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Preface 



This book contains the presentations given at the Workshop on OpenMP Appli- 
cations and Tools, WOMPAT 2001. The workshop was held on July 30 and 31, 
2001 at Purdue University, West Lafayette, Indiana, USA. It brought together 
designers, users, and researchers of the OpenMP application programming inter- 
face. OpenMP has emerged as the standard for shared memory parallel program- 
ming. For the first time, it is possible to write parallel programs that are portable 
across the majority of shared memory parallel computers. WOMPAT 2001 ser- 
ved as a forum for all those interested in OpenMP and allowed them to meet, 
share ideas and experiences, and discuss the latest developments of OpenMP and 
its applications. WOMPAT 2001 was co-sponsored by the OpenMP Architecture 
Review Board (ARB). It followed a series of workshops on OpenMP, including 
WOMPAT 2000, EWOMP 2000, and WOMPEI 2000. 

For WOMPAT 2001, we solicited papers formally and published them in 
the form of this book. The authors submitted extended abstracts, which were 
reviewed by the program committee. All submitted papers were accepted. The 
authors were asked to prepare a final paper in which they addressed the reviewers 
comments. The proceedings, in the form of this book, were created in time to 
be available at the workshop. In this way, we hope to have brought out a timely 
report of ongoing OpenMP-related research and development efforts as well as 
ideas for future improvements. 

The workshop program included the presentations of the 15 papers in this 
book, two keynote talks, a panel discussion, and the founding meeting of an 
OpenMP users’ group. The keynote talks were given by David Padua, University 
of Illinois, entitled “OpenMP and the Evolution of Parallel Programming” , and 
by Larry Meadows, Sun Microsystems, entitled “State of the OpenMP ARB”, 
respectively. The panel was entitled “OpenMP Beyond Shared Memory” . 

As WOMPAT 2001 was being prepared, the next OpenMP workshop had 
already been announced, called EWOMP 2001, to be held in Barcelona, Spain. 
This only adds to the evidence that OpenMP has become a true standard for 
parallel programming, is very much alive, and is of interest to an increasingly 
large community. 
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Abstract. We present a new benchmark suite for parallel comput- 
ers. SPEComp targets mid-size parallel servers. It includes a num- 
ber of science/engineering and data processing applications. Parallelism 
is expressed in the OpenMP API. The suite includes two data sets, 
Medium and Large, of approximately 1.6 and 4 GB in size. Our overview 
also describes the organization developing SPEComp, issues in creating 
OpenMP parallel benchmarks, the benchmarking methodology underly- 
ing SPEComp, and basic performance characteristics. 



1 Introduction 

Parallel program execution schemes have emerged as a general, widely-used com- 
puter systems technology, which is no longer reserved for just supercomputers 
and special purpose hardware sytems. Desktop and server platforms offer mul- 
tithreaded execution modes in today’s off-the shelf products. The presence of 
parallelism in mainstream computer systems necessitates development of ade- 
quate yardsticks for measuring and comparing such platforms in a fair manner. 

Gurrently, no adequate yardsticks exist. Over the past decade, several com- 
puter benchmarks have taken aim at parallel machines. The SPLASH [7| bench- 
marks were used by the research community, but have not been updated recently 
to represent current computer applications. Similarly, the Perfect Benchmarks j2j 
used to measure high-performance computer systems at the beginning of the 
90es. They included standard, sequential programs, which the benchmarker had 
to transform for execution on a parallel machine. The Parkbench effort 0 was 
an attempt to create a comprehensive parallel benchmark suite at several sys- 
tem levels. However, the effort is no longer ongoing. The SPEGhpc suite m is 
a currently maintained benchmark for high-performance computer systems. It 
includes large-scale computational applications. 

In contrast to these efforts, the goal of the present work is to provide a 
benchmark suite that 

— is portable across mid-range parallel computer platforms. 
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— can be run with relative ease and moderate resources, 

— represents modern parallel computer applications, and 

— addresses scientific, industrial, and customer benchmarking needs. 

Additional motivation for creating such a benchmark suite was the fact that 
the OpenMP API (www.openmp.org) has emerged as a de-facto standard for 
expressing parallel programs. OpenMP naturally offers itself for expressing the 
parallelism in a portable application suite. The initiative to create such a suite 
was made under the auspices of the Standard Performance Evaluation Corpo- 
ration (SPEC, www.spec.org), which has been developing various benchmark 
suites over the last decade. The new suite is referred to at SPEComp, with the 
first release being SPEComp2001. 

The SPEC organization includes three main groups, the Open Systems Group 
(OSG), best known for its recent SPEC CPU2000 benchmark, the Graphics 
Performance Characterization Group (CPC), and the High Performance Group 
(HPG) . The SPEComp initiative was taken by the HPG group, which also devel- 
ops and maintains the SPEChpc suite. Current members and affiliates of SPEC 
HPG are Compaq Computer Corp., Fujitsu America, Intel Corp., Sun Microsys- 
tems, Silicon Graphics Inc., Argonne National Lab, Leibniz- Rechenzentrum, Na- 
tional Cheng King University, NCSA/University of Illinois, Purdue University, 
Real World Computing Partnership, the University of Minnesota, Tsukuba Ad- 
vanced Computing Center, and the University of Tennessee. In contrast to the 
SPEChpc suite, we wanted to create smaller, more easily portable and executable 
benchmarks, targeted at mid-range parallel computer systems. Because of the 
availability of and experience with the SPEC CPU2000 suite, we decided to 
start with these applications. Where feasible, we converted the codes to parallel 
form. With one exception, all applications in SPEComp2001 are derived from 
the CPU2000 suite. We also increased the data set significantly. The first release 
of the suite includes the Medium data set, which requires a computer system 
with 2GB of memory. An even larger data set is planned for a future release. An- 
other important difference to the SPEC CPU2000 benchmarks is the run rules, 
which are discussed in section E21 

The remainder of the paper is organized as follows. Section El gives an 
overview of the benchmark applications. Section El presents issues we faced in 
developing the OpenMP benchmark suite. Section 0 discusses basic SPEComp 
performance characteristics. 



2 Overview of the SPEComp Benchmarks 

2.1 The SPEComp2001 Suite 

SPEComp is fashioned after the SPEC CPU2000 benchmarks. Unlike the SPEC 
CPU2000 suite, which is split into integer and floating-point applications, 
SPEComp2001 is partitioned into a Medium and a Large data set. 

The Medium data set is for moderate size SMP (Shared- memory Multipro- 
cessor) systems of about 10 CPUs. The Large data set is oriented to systems with 
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Table 1. Overview of the SPEComp2001 Benchmarks 



Benchmark name 


Applications 


Language 


lines 


ammp 


Chemistry /biology 


C 


13500 


applu 


Fluid dynamics/physics 


Fortran 


4000 


apsi 


Air pollution 


Fortran 


7500 


art 


Image Recognition/neural networks 


C 


1300 


facerec 


Face recognition 


Fortran 


2400 


fmaSd 


Crash simulation 


Fortran 


60000 


gafort 


Genetic algorithm 


Fortran 


1500 


galgel 


Fluid dynamics 


Fortran 


15300 


equake 


Earthquake modeling 


C 


1500 


mgrid 


Multigrid solver 


Fortran 


500 


swim 


Shallow water modeling 


Fortran 


400 


wupwise 


Quantum chromodynamics 


Fortran 


2200 



30 CPUs or more. The Medium data sets have a maximum memory requirement 
of 1.6 GB for a single CPU, and the Large data sets require up to 6 GB. Run 
times tend to be a bit long for people used to running SPEC CPU benchmarks. 
Single CPU times can exceed 10 hours for a single benchmark on a single state- 
of-the-art processor. Of the twelve SPEComp2001 applications, nine codes are 
written in Fortran and three are written in C. TableQ shows basic features of the 
benchmarks. The suite includes several large, complex modeling and simulation 
programs of the type used in many engineering and research organizations. The 
application areas include chemistry, mechanical engineering, climate modeling, 
physics, image processing, and decision optimization. 

2.2 SPEComp Benchmarking Methodology 

The overall objective of the SPEComp benchmark suite is the same as that 
of most benchmarks: to provide the user community with a tool to perform 
objective series of tests. The test results serve as a common reference in the 
evaluation process of computer systems and their components. 

SPEComp provides benchmarks in the form of source code, which are com- 
piled according to a specific set of rules. It is expected that a tester can obtain a 
copy of the suite, install the hardware, compilers, and other software described 
in another tester’s result disclosure, and reproduce the claimed performance 
(within a small range to allow for run-to-run variation). 

We are aware of the importance of optimizations in producing the best system 
performance. We are also aware that it is sometimes hard to draw an exact line 
between legitimate optimizations that happen to benefit the SPEComp bench- 
marks and optimizations that specifically target these benchmarks. However, 
with the list below, we want to increase awareness among implementors and end 
users towards the issues related to unwanted benchmark-specific optimizations. 
Such optimizations would be incompatible with the goal of fair benchmarking. 

The goals of the SPEComp suite are to provide a reliable measurement of 
SMP system performance, and also to provide benchmarks where new technol- 
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ogy, pertinent to OpenMP performance, can be evaluated. For these reasons, 
SPEC allows limited source code modifications, even though it possibly compro- 
mises the objectivity of the benchmark results. We will maintain this objectivity 
by implementing a thorough review process. 

To ensure that results are relevant to end-users, we expect that the hard- 
ware and software implementations used for running the SPEComp benchmarks 
adhere to the following conventions: 

— Hardware and software used to run the SPEComp benchmarks must provide 
a suitable environment for running typical C and FORTRAN programs. 

— Optimizations must generate correct code for a class of programs, where 
the class of programs must be larger than a single SPEComp benchmark or 
SPEComp benchmark suite. This also applies to assertion flags and source 
code modifications that may be used for peak measurements. 

— Optimizations must improve performance for a class of programs where the 
class of programs must be larger than a single SPEComp benchmark or 
SPEComp benchmark suite. 

— The vendor encourages the implementation for general use. 

— The implementation is generally available, documented and supported by 
the providing vendor. 

Benchmarking results may be submitted for a base, and optionally, a peak run. 
The base run is compiled and executed with a single set of compiler flags, and no 
source code modification is permitted. For the peak run, separate compiler flags 
may be used for each program, and limited source code modifications, restricted 
to the optimization of parallel performance, are permitted. 

3 Issues in Developing an OpenMP Benchmark Suite 

Several issues had to be resolved in developing the SPEComp suite. These issues 
include transforming the original codes to OpenMP, resolving portability prob- 
lems, defining new data sets, creating self- validation code for each benchmark, 
and developing benchmark run tool. 

3.1 Transforming Sequential Applications to OpenMP 

A major effort in creating SPEComp was to convert the original, sequential 
programs into OpenMP parallel form. We give brief descriptions of the major 
transformation steps in creating OpenMP parallel programs. 

To analyze parallelism in the given codes, we used a mix of application-level 
knowledge and program- level analysis. In several codes we started by manually 
inlining subroutine calls for easier interprocedural analysis. We then identified 
variables that are defined within potential parallel regions, and variables that 
are “live out” (e.g. formal parameters in a subroutine or function call, function 
return value, and COMMON blocks). 

We declared as PRIVATE scalar variables that are defined in each loop iteration 
and that are not live out. There were a few instances of LASTPRIVATE variables. 
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We then identified any scalar and array reductions. Scalar reductions were placed 
on OpenMP REDUCTION clauses. Array reductions were transformed by hand 
(Note, that the OpenMP 2.0 specification supports array reductions). 

ammp: ammp was by far the most difficult program to parallelize. In addition to 
directives, we added 16 calls to various OpenMP subroutines. There were only 
13 pragmas added, but extensive revisions of the source base were necessary to 
accommodate parallelism. A few hundred lines of source code were moved or 
modified to get acceptable scalability. Key to getting scalability was the organi- 
zation of the program using vector lists instead of linked lists. 

applu: We added a total of 50 directives, with almost all simply a form of 
PARALLEL or DO. A NDWAIT clause was added for the terminator of one loop 
to improve scalability. 

apsi: The main issue in transforming apsi was to privatize arrays that are sec- 
tions of larger, shared arrays. We did this by declaring separate arrays in the 
subroutine scope. The size of the arrays is derived from the input parameters 
(MAX(nx,ny,nz)). Another way to do this is to ALLOCATE the arrays inside OMP 
PARALLEL but before OMP DO. Several simple induction variable also had to be 
substituted in this code. In performing these transformations we followed the 
parallelization scheme of the same code in [S| . 

art: For art, one difficulty encountered in defining a large data set with reason- 
able execution time involved the use of the -objects switch. When the -objects N 
switch is invoked {N is an integer specifying the number of objects to simulate), 
the neural network simulates the learning of several more objects than actually 
trained upon, art’s memory requirements scale quadratically with the number of 
learned objects. Unfortunately, the execution time also scales quadratically due 
to the F2 layer retry on a mismatch. The art 2 algorithm essentially tests every 
learned object against the current inputs in a prioritized manner (more probable 
matches are tested first). To reduce the execution time, the number of F2 layer 
retries was limited to a small constant value. The reduction still allowed the real 
objects to be found. 

facerec: In this code, pictures in an album are compared against a probe gallery. 
Both the generation of the album graphs, and the comparison of the album 
to the probe gallery are parallelized. The parallelization of the probe gallery 
comparison, which takes up almost all of the computation time, is done on the 
probe picture level. This is a ’’shared nothing” parallelization, which is achieved 
by copying the album graphs into a private array. This method performs well on 
systems with nonuniform memory access. There still remain many opportunities 
for parallelism inside the photo gallery comparison code. These opportunities 
could be exploited using nested parallelism. It would facilitate scaling to larger 
numbers of processors. 
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fma3d: fmaSd is the largest and most complex code in the suite. There are over 
60,000 lines of code, in which we added 127 lines of OpenMP directives. Nearly 
all directives were of the PARALLEL or DO variety. There were about a dozen 
THREADPRIVATE directives for a common block, ten reductions, a critical section, 
and a number of NOWAIT clauses. Still, locating the place for these directives was 
not too difficult and resulted in reasonable scalability. 

gafort: We applied three major transformations. First, the “main Generation 

loop” was restructured so that it is private. It enabled parallelization of the 
“Shuffle” loop outside this major loop. Inlining of two subroutines expanded this 
main loop, reducing a large number of function calls. The second transformation 
concerns the “Random Number Generator” . It was parallelized without changing 
the sequential algorithm. Gare was taken to reduce false-sharing among the state 
variables. Third, the “Shuffle” loop was parallelized using OpenMP locks. The 
parallel shuffle algorithm differs from the original sequential algorithm. It can 
lead to different responses for different parallel executions. While this method 
leads to better scalability and is valid from the application point of view, it made 
the implementation of the benchmark validation procedure more difficult. 

galgel: This code required a bit more attention to parallelization than most, 
because even some smaller loops in the LAPAGK modules need OpenMP direc- 
tives, and their need became apparent only when a significant number of GPUs 
were used. A total of 53 directives were added with most being simple PARALLEL 
or DO constructs. A total of three NDWAIT clauses were added to aid in scalability 
by permitting work in adjacent loops to be overlapped. 

equake: One of the main loops in this code was parallelized at the cost of sub- 
stantial additional memory allocation. It is an example of memory versus speed 
tradeoff. Three small loops in the main timestep loop were fused to create a 
larger loop body and reduce the Fork- Join overhead proportionally. The most 
time-consuming loop (function smvp) was parallel, but needed the transforma- 
tion of array reductions. The initialization of the arrays was also parallelized. 

mgrid: This code was parallelized using an automatic translator. The code is 
over 99% parallel. The only manual improvement was to avoid using multiple 
DMP PARALLEL/END PARALLEL constructs for consecutive parallel loops with no 
serial section in-between. 

swim: In this code we added parallel directives to parallelize 8 loops, with a 
reduction directive needed for one loop. Swim achieves nearly ideal scaling for 
up to 32 GPUs. 

wupwise: The transformation of this code to OpenMP was relatively straight- 
forward. We first added directives to the matrix vector multiply routines for 
basic OpenMP parallel do in the outer loop. We then added OpenMP directives 
to the LAPAGK routines (dznrm2 . f zaxpy.f zcopy.f zdotc.f zscal.fj.We 
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also inserted a critical section to dznrm2 . f for a scaling section. Reduction di- 
rectives were needed for zdotc.f and zscal.f. After these transformations, 
wupwise achieved almost perfect scaling on some SMP systems. 

3.2 Defining Data Sets and Running Time 

An important part of defining a computer benchmark is the selection of appro- 
priate data sets. The benchmark input data has to create an adequate load on 
the resources of the anticipated test machines, but must also reflect a realistic 
problem in the benchmark’s application domain. In our work, these demands 
were not easy to reconcile. While we could identify input parameters of most 
codes that directly affect the execution time and working sets (e.g., time steps 
and array sizes), it was not always acceptable to modify these parameters indi- 
vidually. Instead, with the help of the benchmark authors, we developed input 
data sets that correspond to realistic application problems. Compared to the 
SPEC CPU2000 benchmarks, SPEComp includes significantly larger data sets. 
This was considered adequate, given the target machine class of parallel servers. 
The split into a Medium (< 2GB) and a Large (< 8GB) data set intends to 
accommodate different machine configurations, but also provides a benchmark 
that can test machines supporting a 64 bit address space. 

3.3 Issues in Benchmark Self- Validation 

An important part of a good benchmark is the self validation step. It indicates 
to the benchmarkers that they have not exploited overly aggressive compiler op- 
tions, machine features, or code changes. In the process of creating SPEComp 
we had to resolve several validation issues, which went beyond those arising in 
sequential benchmarks. One issue is that numerical results tend to become less 
accurate when computing in parallel. This problem arises not only in the well- 
understood parallel reductions. We have also observed that advanced compiler 
optimizations may lead to expression reorderings that invalidate a benchmark 
run (i.e., the output exceeds the accuracy tolerance set by the benchmark de- 
veloper) on larger numbers of processors. Another issue arose in benchmarks 
that use random number generators. The answer of such a program may de- 
pend on the number of processors used, making a validation by comparing with 
the output of a sequential run impossible. To address these issues, we found 
benchmark-specific solutions, such as using double-precision and identifying fea- 
tures in the program output that are invariant of the number of processors. 

3.4 Benchmark Run Tools 

Creating a valid benchmark result takes many more steps than just running 
a code. Procedural mistakes in selecting data sets, applying compilation and 
execution options, and validating the benchmark can lead to incorrect bench- 
mark reports on otherwise correct programs. Tools that support and automate 
this process are an essential part of the SPEC benchmarks. The run tools for 
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SPEComp2001 were derived from the SPEC CPU2000 suite. Typically, bench- 
markers modify only a small configuration file, in which their machine-dependent 
parameters are defined. The tools then allows one to make (compile and link) 
the benchmark, run it, and generate a report that is consistent with the run 
rules described in Section |^1 



3.5 Portability Across Platforms 

All major machine vendors have participated in the development of 
SPEComp2001. Achieving portability across all involved platforms was an im- 
portant concern in the development process. The goal was to achieve functional 
portability as well as performance portability. Functional portability ensured 
that the makefiles and run tools worked properly on all systems and that the 
benchmarks ran and validated consistently. To achieve performance portabil- 
ity we accommodated several requests by individual participants to add small 
code modifications that take advantage of key features of their machines. It was 
important in these situations to tradeoff machine-specific issues against perfor- 
mance properties that hold generally. An example of such a performance feature 
is the allocation of lock variables in gafort. The allocation of locks was moved 
into a parallel region, which leads to better performance on systems that provide 
non-uniform memory access times. 



4 Basic SPEComp Performance Characteristics 

We present basic performance characteristics of the SPEComp2001 applica- 
tions. Note, that these measurements do not represent any SPEC benchmark 
results. All official SPEC benchmark reports will be posted on SPEC’s Web 
pages (www.spec.org/hpg/omp). The measurements were taken before the fi- 
nal SPEComp2001 suite was approved by SPEC. Minor modifications to the 
benchmarks are still expected before the final release. Hence, the given numbers 
represent performance trends only, indicating the runtimes and scalability that 
users of the benchmarks can expect. As an underlying machine we give a generic 
platform of which we know clock rate and number of processors. 

Figure 0 shows measurements taken on 2, 4 and 8 processors of a 350MHz 
machine. The numbers indicate the time one has to expect for running the 
benchmark suite. They also show scalability trends of the suite. Table 0 shows 
the parallel coverage for each code, which is the percentage of serial execution 
time that is enclosed by a parallel region. This value is very high for all appli- 
cations, meaning that these codes are thoroughly parallelized. Accordingly, the 
theoretical “speedup by Amdahl’s Law” is near-perfect on 8 processors, as shown 
in the third column. Column four shows the “Fork-Join” overhead, which is com- 
puted as the {tfj ■ N) /toveraii, where toveraii is the overall execution time of the 
code, N is the dynamic number of invocations of parallel regions, and tfj is the 
Fork-Join overhead for a single parallel region. We have chosen tfj = 10-|-p-2 /is, 
where p is the number of processors. This is a typical value we have observed. 
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Table 2. Characteristics of the parallel execution of SPEComp2001 



Benchmark 


Parallel 

Coverage 


Amdahl’s 
Speedup (8 CPU) 


% Fork-Join 
Overhead (8 CPU) 


Number of 
Parallel Sections 


ammp 


99.2 


7.5 


0.0008336 


7 


applu 


99.9 


7.9 


0.0005485 


22 


apsi 


99.9 


7.9 


0.0019043 


24 


art 


99.5 


7.7 


0.0000037 


3 


equake 


98.4 


7.2 


0.0146010 


11 


facerec 


99.9 


7.9 


0.0000006 


3/2^ 


fmaSd 


99.5 


7.7 


0.0052906 


92/30^ 


gafort 


99.9 


7.9 


0.0014555 


6 


galgel 


96.8 


6.5 


4.7228800 


32/29^ 


mgrid 


99.9 


7.9 


0.1834500 


12 


swim 


99.5 


7.7 


0.0041672 


8 


wupwise 


99.8 


7.9 


0.0036620 


6 



^ static sections / sections called at runtime 



The table shows that the Fork- Join overhead is very small for all benchmarks, 
except for galgel. It indicates that, in all but one codes, the overhead associated 
with OpenMP constructs is not a factor limiting the scalability. Column five 
shows the static number of parallel regions for each code. A detailed analysis of 
the performance of SPEComp2001 can be found in p. 



5 Conclusions 

We have presented a new benchmark suite for parallel computers, called 
SPEComp. We have briefly described the organization developing the suite as 
well as the development effort itself. Overall, the effort to turn the originally se- 
quential benchmarks into OpenMP parallel codes was modest. All benchmarks 
are parallelized to a high degree, resulting in good scalability. 

SPEComp is the first benchmark suite for modern, parallel servers that 
is portable across a wide range of platforms. The availability of OpenMP 
as a portable API was an important enabler. The first release of the suite, 
SPEComp2001, includes a Medium size data set, requiring a machine with 2 GB 
of memory. While the codes have been tuned to some degree, many further 
performance optimizations can be exploited. We expect that the availability of 
SPEComp will encourage its users to develop and report such optimizations. 
This will not only lead to improved future releases of the suite, it will also show 
the value of the new benchmarks as a catalyst for parallel computing technology. 



Acknowledgement. We’d like to acknowledge the important contributions of 
the many authors of the individual benchmark applications. The creation of 
SPEComp2001 would not have been possible without them. 
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Fig. 1. Execution times of the SPEComp2001 benchmarks on 2,4, and 8 processors of 
a generic 350MHz machine. 
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Abstract. The recent parallel language standard for shared mem- 
ory multiprocessor (SMP) machines, OpenMP, promises a simple and 
portable interface for programmers who wish to exploit parallelism ex- 
plicitly. In this paper, we present our effort to develop portable compil- 
ers for the OpenMP parallel directive language. Our compiler consists 
of two parts. Part one is an OpenMP parallelizer, which transforms se- 
quential languages into OpenMP. Part two transforms programs writ- 
ten in OpenMP into thread-based form and links with our runtime li- 
brary. Both compilers are built on the Polaris compiler infrastructure. 
We present performance measurements showing that our compiler yields 
results comparable to those of commercial OpenMP compilers. Our in- 
frastructure is freely available with the intent to enable research projects 
on OpenMP-related language development and compiler techniques. 



1 Introduction 

Computer systems that offer multiple processors for executing a single applica- 
tion are becoming common place from servers to desktops. While programming 
interfaces for such machines are still evolving, significant progress has been made 
with the recent definition of the OpenMP API, which has established itself as 
a new parallel language standard. In this paper, we present a research compiler 
infrastructure for experimenting with OpenMP. 

Typically, one of two methods is used for developing a shared-memory paral- 
lel program: Users may take advantage of a restructuring compiler to parallelize 
existing uniprocessor programs, or they may use an explicitly parallel language 
to express the parallelism in the application. OpenMP is relevant for both sce- 
narios. First, OpenMP can be the target language of a parallelizing compiler that 
transforms standard sequential languages into parallel form. In this way, a par- 
allelizer can transform programs that port to all platforms supporting OpenMP. 
A first contribution of this paper is to describe such a portable parallelizer. Sec- 
ond, programs written in OpenMP can be compiled for a variety of computer 
systems, providing portability across these platforms. Our second contribution 

* This work was supported in part by NSF grants #9703180-CCR and #9872516-EIA. 

** The author is now with Kuck & Associates Software, A Division of Intel Americas, 
Inc., Champaign, IL 61820. 
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is to present such an OpenMP compiler, called PCOMP (Portable Compilers for 
OpenMP), which is publicly available. 

Both compilers are built on the Polaris compiler infrastructure P] , a Fortran 
program analysis and manipulation system. The public availability of PCOMP 
and its runtime libraries makes it possible for the research community to conduct 
such experiments as the study of new OpenMP language constructs, detailed 
performance analysis of OpenMP runtime libraries, and the study of internal 
compiler organizations for OpenMP programs. In addition, this paper makes 
the following specific contributions: 

— We describe an OpenMP postpass to the Polaris parallelizing compiler, which 
generates parallel code in OpenMP form. 

— We describe a compiler that translates OpenMP parallel programs into 
thread-based form. The thread-based code can be compiled by a conven- 
tional, sequential compiler and linked with our runtime libraries. 

— We present performance measurements showing that PCOMP yields results 
comparable to the commercial OpenMP parallelizers. 

We know of two related efforts to provide a portable OpenMP compiler to the 
research community. The Omni OpenMP compiler P| translates C and Fortran 
programs with OpenMP pragmas into C code suitable for compiling with a na- 
tive compiler linked with the Omni OpenMP runtime library. Also the compiler 
provides a cluster-enabled OpenMP implementation on a page-based software 
distributed shared memory. The OdinMP/CCp (C to C with pthreads) |3] is 
also a portable compiler for C with OpenMP to C with POSIX thread libraries. 
Both efforts are related to our second contribution. 

In Section 121 we present an overview of our OpenMP compiler system. Sec- 
tion 0 and Section 0 describe the details of our OpenMP parallelizer, which 
generates parallel code with OpenMP directives, and the OpenMP compiler, 
which translates OpenMP parallel code into thread-based form. In Section we 
measure the performance of PCOMP and other OpenMP compilers. Section El 
concludes the paper. 

2 Portable OpenMP Compilers 

Figure Q gives an overview of our OpenMP compiler system. A program can 
take two different paths through our OpenMP compiler system: (1) A serial 
program is analyzed for parallelism by the Polaris Analysis passes and then 
annotated with OpenMP directives by the OpenMP postpass. The output is 
an OpenMP parallel program. (2) OpenMP parallel programs can be processed 
by the OpenMP directive parser and then fed to the MOERAE postpass El? 
which transforms them into thread-based code. In this scenario, no additional 
parallelism is recognized by the Polaris Analysis passes. The generated code can 
be compiled by a sequential compiler and linked with our MOERAE microtask 
library. 
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Fig. 1. Overview of our OpenMP compiler system. The same infrastructure can be 
used to (1) translate sequential programs to OpenMP parallel form and (2) generate 
code from OpenMP parallel programs. In scenario two, the code is translated into a 
microtask form, making calls to our MOERAE runtime library. 



3 Generating OpenMP from Sequential Program 



The Polaris compiler infrastructure is able to detect loop parallelism in sequential 
Fortran programs. To express this parallelism in OpenMP, we have added an 
OpenMP postpass. It can read the Polaris internal program representation and 
emit Fortran code annotated with OpenMP directives. 

The implementation of this postpass is a relatively straightforward mapping 
of the internal variables in Polaris to their corresponding directives |5|. Only 
in the case of array reductions and in privatizing dynamically allocated arrays, 
more involved language-specific restructuring is done. All other major code re- 
structuring is already implemented in Polaris in a language-independent fashion. 

Minor issues arose in that the parallel model used by OpenMP requires some 
changes to the internal structure of Polaris. Polaris, being tightly coupled to 
the loop-level parallelism model, lacks a clear method for dealing with parallel 
regions and parallel sections. Polaris-internally, program analysis information is 
typically attached to the statement it describes. For example, a loop that is 
found to be parallel has the DO statement annotated with a parallel assertion. 
The statement is likewise annotated with the variables that are to be classified as 
shared, private and reduction within the loop nest. In all loop-oriented parallel 
directive languages that Polaris can generate, this is an acceptable structure. 

In order to represent OpenMP parallel regions we have added loop preambles 
and postambles. Preambles and postambles are code sections at the beginning 
and end of a loop, respectively, that are executed once by each participating 
thread. It is now the statement labeled at the beginning of the preamble, which 
must contain the annotations for parallelism. This statement may no longer be 
the DO statement. The variables classified as reduction variables, still have to 
be associated with the DO loop itself. The specification does not require that 
reductions occur inside loops. You can reduce across parallel regions. These 
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changes in the structure of where information is to be stored, required a review 
of the Polaris passes. 

4 Compiling OpenMP Parallel Programs with PCOMP 

PCOMP (Portable Compiler for OpenMP) translates OpenMP parallel programs 
into thread-based programs. Two capabilities had to be provided to implement 
PCOMP. One is an extension of the parser for OpenMP directives, and the other 
is a translator that converts the Polaris-internal representation into the parallel 
execution model of the underlying machine. As a target we use a thread-based, 
microtasking model, as is common for loop parallel execution schemes. The Po- 
laris translation pass into thread-based forms has already been implemented in 
the form of the MOERAE system, described in 0. MOERAE includes a run- 
time library that implements the microtasking scheme on top of the portable 
POSIX-thread libraries. 

We added parser capabilities to recognize the OpenMP directives and repre- 
sent their semantics in the Polaris internal program form. The OpenMP direc- 
tives are translated in the following manner: 

4.1 Parallel Region Construct 

The PARALLEL and END PARALLEL directives enclose a parallel region, and define 
its scope. Code contained within this region will be executed in parallel on all 
of the participating processors. 

Figure 0 illustrates parallel region construct translation. PCOMP always re- 
places these directives with a parallel DO loop whose lower-bound is a constant 
1 and upper-bound is the number of participating threads. Figures 0 (a) and 
(b) show the code transformation from the source to PCOMP intermediate form 
for a parallel region construct. A variable mycpuid indicates the thread identi- 
fier number and cpuvar denotes the number of participating threads. The newly 
inserted loop is asserted as PARALLEL, and it allows the MOERAE system to 
generate the thread-based code shown in Figure 0(c). At the beginning of the 
program, the function initializeJ,hread is inserted to initialize the execution en- 
vironment, such as setting the available number of threads and creating threads. 
After this initialization, the created threads are in spin-waiting status until func- 
tion scheduling is called. 

The function scheduling invokes the runtime library to manage the threads 
and copies shared arguments (parameters in Figure 0 into the child threads. 
Our runtime library provides different variants of this function for the different 
scheduling options. Currently, only static scheduling is supported. 

There are several directive clauses, which may be included on the same line 
as the PARALLEL directive. They control attributes of the region. The directive 
clauses are translated as follows. 

The lY {expression) clause will cause the parallel region to be executed on a 
single thread, if the expression evaluates to FALSE. This directive will cause a 
two-version loop to be generated; one loop is serial and the other is parallel. 
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!$OMP PARALLEL 




!$OMP END PARALLEL 



(a) OpenMP parallel program 




SUBROUTINE _subr_name(parameters) 






END 

PROGRAM main 

CALL lnltlallze_thread() 
CALL schedulIngOsubr 


_name, parameters,...) 



(b) PCOMP Internal representation (c) PCOMP thread-based code 



Fig. 2. OpenMP PARALLEL construct transformation. The parallel region is sur- 
rounded by a loop whose lower-bound is a constant 1 and upper-bound is the number 
of participating threads. 



The SEAREDivariableJist) clause includes variables that are to be shared 
among the participating threads. 

The PRlVkTE(variableJist) clause includes variables that are local to each 
thread, and for which local instances must exist on each participating thread. 

The LASTPRIVATE(wana&Ze_&t) clause is similar to PRIVATE directive. The 
difference is that when the LASTPRIVATE clause appears on a DO directive, the 
thread that executes the sequentially last iteration updates the version of the 
object it had before the construct. PCOMP uses two- version statements to con- 
ditionally perform the updates to the variables if the iteration is the last one. 

The REDUCTI0N(?;arjaWe_ZzsO directive clause contains those scalar variables 
that are involved in a scalar reduction operation within the parallel loop. The Po- 
laris program representation already contains fields with this semantics. PCOMP 
transforms the OpenMP clauses into these fields. The parallel reduction trans- 
formation is then performed using the techniques described in |E| . Preamble and 
postamble code is generated to initialize a new, private reduction variable and 
to update the global reduction variable, respectively. The update is performed 
in a critical section using a lock/unlock pair. 

4.2 Work-Sharing Construct 

The DO and END DO directives enclose a DO loop inside a region. The iteration 
space of the enclosed loop is divided among the threads. As an example, the 
translation process of the DO directive with a REDUCTION clause is described in 
Figure 0 The PCOMP OpenMP parser reads Work-Sharing directives and as- 
sert them to the corresponding DO statement. In Figure 0(b), there are two DO 
statements. The first one is a parallel DO loop, which is created by the trans- 
lation of the PARALLEL construct. The directive clauses, such as PRIVATE and 
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(b) PCOMP internal representation (c) Thread-based program generated by PCOMP 



Fig. 3. OpenMP PARALLEL REDUCTION DO construct transformation 



REDUCTION are transformed to the assertion type information. These assertions 
give the PCOMP postpass information to generate thread-based code. The sec- 
ond DO statement is the original DO statement whose loop bound indices will be 
modified to share iteration space by the participating threads. 

Figure He) depicts how the iteration space is divided using loop bound modi- 
fication. The loop bounds are adjusted to reflect the chunk of iterations assigned 
to each thread. The if statement that compares myepuid, a thread number with 
epuvar, the number of threads detects the last iteration and adjust the upper 
bound of the loop to come up with the case when the total number of iterations 
is not a multiple of the number of threads. PCOMP generates postamble code 
for REDUCTION clauses and encloses the postamble code with lock and unlock 
functions implemented in the MOERAE runtime libraries 0. 



Synchronization Constructs. Critical blocks are surrounded by CRITICAL 
and END CRITICAL directives. The code enclosed in a CRITICAL/END CRITICAL 
directive pair will be executed by only one thread at a time. We replace the 
directives with lock and unlock functions. The BARRIER directive synchronizes 
all the threads in a team. When encountered, each thread waits until all of 
the other threads in that team have reached this point. PCOMP replaces the 
BARRIER directive with a sync function, which is a runtime library function. 

The current implementation of PCOMP supports a subset of the OpenMP 
1.0 specification for Fortran. Among the unsupported constructs are parallel 
sections, flushing operation in critical section and dynamic scheduling for parallel 
loops. 
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5 Performance Results 

We used benchmark programs (WUPWISE, SWIM, MGRID, and APSI) from the 
SPEC CPU2000 the SPEComp2001 benchmark suite to evaluate the 

performance of our compiler. We generated executable code using the op- 
tions f95 -fast -stackvar -mt -nodepend -xvector=no -xtarget=ultra2 
-xcache=16/32/l :4096/64/l. The thread-based codes by PCOMP are linked 
with the MOERAE runtime libraries. For comparison, we also compiled each 
OpenMP code with the SUN Forte 6.1 OpenMP compiler. We ran the codes on 
a SUN Enterprise 4000 (Solaris 2.6) system 0 using re/ data sets. 

First, we parallelized the SPEC2000 applications using the Polaris paralleliz- 
ing compiler P, which generated the OpenMP codes using our OpenMP post- 
pass. We generated two executable codes: (1) using the SUN parallel compiler, 
and (2) using a sequential compiler to compile the translated code by PCOMP 
and link with the runtime library. Figure P shows the speedup of these codes 
relative to the serial execution time of the original codes. The figure shows that 
our compiler performs similarly to the commercial compiler. The super-linear 
speedup in SWIM is due to a loop interchange in the SHAL0W_D03500 loop by the 
parallelizer. In MGRID the performance of our compiler is better than Forte 6.1 
in all loops except in RESID_D0600, where our thread-based postpass generates 
inefficient code handling LASTPRIVATE variable attributes. Figure 0] shows that 
the Polaris-generated codes successfully exploit parallelism in two of the four 
benchmarks. 

We used an early version of the SPEComp2001 benchmarks for measuring 
the performance of our PCOMP compiler. Figure El shows the speedup of these 
codes relative to the one-processor execution time of the code generated by the 
SUN compiler. The figure shows that our compiler performs similarly to the 
commercial compiler. 



6 Conclusion 

We presented an OpenMP compiler system for SMP machines. The compilers 
are built using the Polaris infrastructure. The system includes two compilers 
for translating sequential programs into OpenMP and for compiling OpenMP 
programs in a portable manner (PCOMP), respectively. 

We showed that the performance of PCOMP is comparable to commercial 
OpenMP compilers. Our infrastructure is publicly available, enabling experi- 
mental research on OpenMP-related language and compiler issues. The Polaris 
infrastructure has already been widely used in parallelizing compiler research 
projects. The availability of an OpenMP compiler component now also supports 
this important, emerging standard in parallel programming languages. Currently, 
only Fortran?? is supported. Extensions for FortranQO and C are being devel- 
oped, providing a complete open-source compiler environment for OpenMP. 
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APSI MGRID SWIM WUPWISE 



□ Forte (P=1) 0 Forte (P=2) ■ Forte (P=4) O PCOMP (P=1) H PCOMP (P=2) ^ PCOMP (P=4) 

Fig. 4. Automatic generation of OpenMP programs. Speedup of benchmarks as 
executed on SUN Enterprise 4000 using our OpenMP postpass. The OpenMP codes 
are parallelized by the Polaris parallelizer with our OpenMP postpass. The codes are 
compiled by the SUN Forte compiler, and by our PCOMP Portable OpenMP translator 
with a sequential compiler. P represents the number of processors used. 

4 T 1 



3.5 




APSI MGRID SWIM WUPWISE 



□ Forte (P=1) 0Forte(P=2) ■ Forte (P=4) E PCOMP (P=1) 0 PCOMP (P=2) 0 PCOMP (P=4) 

Fig. 5. Compilation of OpenMP programs. Speedup of benchmarks as executed 
on SUN Enterprise 4000. The OpenMP Suite codes are compiled by the SUN Forte 
compiler and our PCOMP Portable OpenMP compiler. P represents the number of 
processors used. 
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Abstract. This paper describes an implementation and a preliminary 
evalnation of the Omni OpenMP compiler on a parallel computer Cenju- 
4. The Cenju-4 is a parallel computer which support hardware distributed 
shared memory (DSM) system. The shared address space is explicitly al- 
located at the initialization phase of the program. The Omni converts all 
global variable declarations into indirect references through the pointers, 
and generates code to allocate those variables in the shared address space 
at runtime. The OpenMP programs can execute on a distributed memory 
machine with hardware DSM by using the Omni. The preliminary re- 
sults using benchmark programs show that the performance of OpenMP 
programs didn’t scales. While its performance of OpenMP benchmark 
programs scales poorly, it enables users to execute the same program on 
both a shared memory machine and a distribute memory machine. 



1 Introduction 

OpenMP aims to provide portable compiler directives for the shared memory 
programming environment. The OpenMP directives specify parallel actions ex- 
plicitly rather than as hints for parallelization. The OpenMP language specifi- 
cation, which is a collection of compiler directives, library routines, and environ- 
ment variables, came out in 1997 for the Fortran, and in 1998 for the c/c++in. 
While high performance computing programs, especially for scientific computing, 
are often written in Fortran as the programming language, many programs are 
written in C in workstation environments. Recently, compiler vendors for PCs 
and workstations have endorsed the OpenMP API and have released commercial 
compilers that are able to compile an OpenMP parallel program. 

We developed the Omni OpenMP compiler for a shared memory system |2] 
and for a page-based software distributed shared memory (DSM) system on a 
cluster of PCs 0. This paper describes the implementation of the Omni OpenMP 
compiler on a hardware support DSM system, and examine the performance 
improvement gained by using the OpenMP programming model. 

* NEC Corporation, k-kusano@cq.jp.nec.com 

** University of Tsukuba, msato@is.tsukuba.ac.jp 

R. Eigenmann and M.J. Voss (Eds.): WOMPAT 2001, LNCS 2104, pp. 20-^21 2001. 
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There are many projects related to OpenMP, for example, research to ex- 
ecute an OpenMP program on top of the DSM environment on a network 
of workstations Pni 13) E^nd the investigation of a parallel programming model 
based on the MPI and the OpenMP to utilize the memory hierarchy of an SMP 
cluster^. Our project also try to develop the OpenMP environment on an SMP 
cluster, based on the Omni0. 

The remainder of this paper is organized as follows: Section 2 presents the 
overview of the Omni OpenMP compiler. The hardware support DSM system 
of the Cenju-4 and the implementation details are presented in section 3. The 
experimental results and discussion are shown in section 4. The section 5 presents 
the concluding remarks. 

2 The Omni OpenMP Compiler 

This section presents the overview of the Omni OpenMP compiler Pj 0- 

The Omni OpenMP compiler is an open source OpenMP compiler for an 
SMP machine. Q. It is a translator which takes an OpenMP program as input 
and generates a multi-thread C program with run-time library calls. It consists of 
three parts: the front-ends, the Omni Exc Java toolkit and the run-time library. 
The front-ends for C and FORTRAN?? are available now. The input programs 
are translated into an intermediate code, called an Xobject code, by the front- 
end. 

The Exc Java toolkit is a Java class library that provides classes and methods 
to analyze and transform the Xobject code. It also has function to unparse the 
Xobject code into a C program. The representation of the Xobject code which 
is manipulated by the Exc Java toolkit is a kind of Abstract Syntax Tree (AST) 
with data type information. 

The generated C programs are compiled by a native C compiler, and linked 
with the Omni run-time library to execute in parallel. It uses the POSIX thread 
library for parallel execution, and this makes it easy to port to other platforms. 
Platforms that it has already been ported to are the Solaris on the Sparc and 
on the Intel, the Linux on the Intel, the IRIX on the SGI, and the AIX on the 
Power. 



2.1 Shmem Memory Model 

We chose a shmem memory model to port the Omni to a software DSM system 0. 
In this model, all global variables are allocated in the shared address space at run 
time, and those variables are referred through the pointers. The Omni converts 
all global variable declarations into indirect references through the pointers, 
and generates code to allocate those variables at run time. This conversion is 
shown in figure E and figure |2J Declaration and reference of global variables in 



^ The Omni OpenMP compiler for a SMP platform is freely available at 
http : //pdplab . trc . rwcp . or . jp/Omni/. 
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float x; /* global variable */ 
float fx[10]; /* global array */ 

fx[2] = x; 



Fig. 1. Original code 

float * G_x; /* pointer to x */ 

float * G_fx; /* pointer to fx */ 

*(__G_fx + 2) = (*__G_x) ; 

static G_DATA_INIT(){ /* initialize function */ 

_shm_data_init(&( G_x) ,4,0) ; 

_shm_data_init(&( G_fx) ,40,0) ; 

} 



Fig. 2. Converted code 



the original code, a variable x and an array fx, are converted to pointers. The 
allocation of these variables are done in the compiler generated initialization 
function, ’__G_DATAJNIT()’ in figure 13 This function is generated for each 
compilation unit, and called at the beginning of execution. Since the allocation 
of the DSM is also done by the library call in the program, the shmem memory 
model can be applied to the hardware support DSM system of the Cenju-4. 



2.2 Runtime Library 

The runtime library of the Omni on the Cenju-4 is basically the same as the 
shared memory version, except the machine dependent functions. The barrier 
is such a function and it simply calls the library function of the Cenju-4. The 
process of parallel construct is basically the same as the software DSM version 
Omni. At the start of the ’parallel’, pointers to shared variables are copied into 
the shared address space and passed to slaves, like the software DSM version. The 
slave processes are waiting to start parallel execution until the trigger from the 
library occurs. The allocation and deallocation of these threads are managed 
by using a free list in the runtime library. The list operations are executed 
exclusively using the system lock function. Nested parallelism is not supported. 



3 NEC Cenju-4 

The overview of NEC Cenju-4 which is the platform we used is presented in this 
section. 
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3.1 Overview 

A Cenju- 400 is a distributed memory parallel computer designed and manu- 
factured by the NEC Corporation. Figure 01 shows the overview of the Cenju-4. 




Fig. 3. The Cenju-4 overview 



The Cenju-4 is a non-uniform memory access (NUMA) multi-processor. The 
multistage network of the Cenju-4 connects up to 1024 nodes by using 4 x 
4 crossbar switches. The network has the following features: in-order message 
delivery between any two nodes, multi-cast and gather functions, and a deadlock 
free mechanism. Each node of the Cenju-4 consists of an RIOOOO processor, IM 
bytes secondary cache, a main memory up to 512M bytes and a controller chip. 
The controller chip accepts a user level message passing and a DSM access. It is 
not allowed to use both the message passing and the DSM access on the same 
page. This attribute is controlled by the operating system that is based on a 
MACH micro-kernel per page bases. 

3.2 Distributed Shared Memory 

The DSM of the Cenju-4 is implemented with the use of coherent caches and a 
directory-based coherence protocol. The DSM has the following four character- 
istics: 

— A directory dynamically switches its representation from a pointer structure 
to a bit-pattern structure, according to the number of nodes. This scheme re- 
quires the constant memory area regardless of the number of processors, and 
achieves an efficient record of nodes. It achieves scalability in both hardware 
and performance. 

— The Cenju-4 DSM use multicast and gather functions of a network to deliver 
requests and collect replies. It reduces the overhead of cache invalidation 
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message. The Cenju-4 also adopts a directory which can specify all nodes 
caching a block with one memory access. 

— A cache coherence protocol that prevents starvation. The Cenju-4 adopts a 
blocking protocol for cache coherence: requests which cannot be processed 
immediately are queued in the main memory for later processing. The buffer 
size of this queue is 32K bytes for 1024 nodes. 

— A deadlock-free mechanism with one network. The mechanism that queues 
certain types of messages for cache coherence in the main memory is pro- 
vided. The buffer size required for queuing messages is 128K bytes for 1024 
nodes. This is allocated in different area from the previous one. This buffer 
and previous one for starvation are allocated and used in different functions. 
The cache coherence protocol and the deadlock-free mechanism guarantees 
shared memory accesses will finish in finite time. 

Users have to insert library calls to use the DSM function, since the compiler 
that can generate codes to utilize the DSM function is not available. The shared 
address space is allocated by using a library call, and shared variables are al- 
located or re-allocated in that space. Hereafter, the allocated shared variables 
can be accessed the same as the private data. These process is almost the same 
as the Omni on the software DSM system, and the shmem memory model is 
efficient for the Cenju-4. 

There are some restrictions to use the Cenju-4 DSM: first, the maximum size 
of the shared address space is limited to the physical memory size. Next, the 
size of the shared address space is limited to 2 Giga bytes. This is because of the 
MIPS processor architecture that limits user address space up to 2 Giga bytes. 

3.3 Execution Model on Cenju-4 

The OpenMP uses a fork-join model of a parallel execution. In this model, slave 
threads are created when the parallel execution is started. A master thread has 
responsible for the allocation of shared address space and shared variables and 
setting of parallel execution environment. 

The Omni creates slave threads once in a program just after initialization 
of the master thread. These threads are managed by the run-time library, and 
remains until the end of a program. The Omni on the software DSM uses this 
model, but it is implemented on a process on each processor instead of a thread. 

The execution model of the Cenju-4 is a MIMD style execution model, like an 
MPI program. A user program is distributed to all processors by the command 
’cjsh’, almost the same as ’mpirun’ for MPI program. The Omni initialization 
process has to adapt for this execution model. 

An entry function and initialization process is modified as follows: first, all 
process execute the same entry function, and checks its PE number by using 
library call. Then, only a master process on the PEG starts execution of the 
initialization function, and other processes are waiting to start execution. After 
the master process ends up the initialization, other slave processes are start to 
execute the initialization function. Those steps make slave processes possible to 
share the master process data. 
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3.4 Shared Address Space 

All processes have to execute library ’dsm_open’ with the same DSM size as its 
argument to allocate the shared address space. This library call allocates shared 
memory for DSM for each processor and enables DSM access. The DSM function 
is enabled or disabled per page bases by the operating system. The address of 
shared space can be specified by the user, though the library allocates requested 
size automatically and returns its address. The size of shared area is limited to 
2GB on Cenju-4 DSM. 

After the allocation of the shared address space, shared variables are re- 
allocated in that space explicitly in the program by using ’dsm_malloc’ library. 
This re-allocation is also done by the master thread while its initialization pro- 
cess. This library allocates memory in a block cyclic manner with 128 bytes, 
the same size as the DSM management block. Each block is dedicated for one 
variable, even its size is smaller than the 128 bytes. 

4 Preliminary Results 

This section presents the performance of OpenMP programs on the Cenju-4 
DSM. 



4.1 Benchmarks and Evalnation Environment 

The Cenju-4 with 16 processors is used in the following experiment. Each node 
has a MIPS RIOOOO processor with IM bytes secondary cache and 512M bytes 
main memory. The multistage network has two stages for 16 PE. The operating 
system is a native OS of the Cenju-4. The C compiler is based on GNU C 
compiler and is a cross compiler on a front-end machine, NEC EWS4800. The 
optimize option ’-03 -funroll-loops’ is specified for all experiment. The Cenju-4 
system library ’cjsh’ is used to invoke a parallel program. 

We chose the microbenchmark and the C version NAS Parallel Benchmarks 
(NPB)|7] which is widely used parallel benchmark programs as our workload. 
The NPB programs are originally written in Fortran, we rewrote those into C 
code, and parallelized those using OpenMP directives. The results of CG, FT, 
IS, and LU are shown in the followings. 



4.2 OpenMP Overhead 

The microbenchmark |B|, developed at the University of Edinburgh, is intended to 
measure synchronization and loop scheduling overhead of the OpenMP runtime 
library. The benchmark measures the performance overhead incurred by the 
OpenMP directives. 

We measured the overhead of OpenMP directive by using the microbench- 
mark. Figure El shows the results of ‘parallel’, ‘for’, ‘parallel-for’, ‘barrier’, and 
‘parallel-reduction’, using the Omni on the Cenju-4 DSM. 
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Fig. 4. Omni overhead(usec) 



This result shows the overhead is increased in almost linear with the number 
of processors. The ‘barrier’ is almost five times slower than the SUN at 4 PE|5|. 
The overhead of ‘parallel’ at 4 PE is almost twice as long as the one of the Omni 
on the SUN 4 PE. 



4.3 NPB Parallelization 

The NPB programs are parallelized by using OpenMP directives. We use ’or- 
phan’ directives to minimize the number of parallel construct and reduces the 
overhead of parallelization. The code fragment of the parallelized NPB main 
loop is shown in figure 0 



main Of 

#pragma omp parallel private (it .... ) 
for(it=l; it<=NITER; it++)f 
<function call, etc> 

> 

} 



Fig. 5. Parallelization of the NPB 



The OpenMP ‘for’ directives are inserted for the parallelizable loops. The 
scheduling policy for the parallel loops is not specified. It means that the default 
scheduling of the Omni, static block scheduling, is used. The ‘single’, ‘master’ 
and ‘critical’ directives are also inserted appropriately. The code fragment of 
main calculation loop of CG is shown in figure El 
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conj_grad( ...)■[ 

for(cgit=. . . )■[ ... 

#pragma omp for private(k, . . .) 
for(j=0; j<. . .){ 
f or (k= . . . ) { 



} 



y 

y 



Fig. 6. Parallelization of the CG 



4.4 NPB Performance Results 

The execution time of the NPB programs are shown in the tabled] The ’S’ in 
the table indicates sequential execution. The size of each benchmark program is 
class A for CG and IS, and class W for FT and LU. The last two programs are 
class W because of the memory shortage. 



Table 1. NPB Execution time(sec) 



PE 


cg(A) 


is(A) 


ft(W) 


lu(W) 


S 


62.06 


47.83 


8.19 


386.71 


1 


73.22 


49.90 


9.26 


396.73 


2 


60.14 


27.15 


6.22 


241.09 


4 


38.28 


16.14 


3.92 


125.12 


8 


22.42 


12.85 


2.61 


68.04 


16 


11.74 


15.28 


1.75 


37.78 



The OpenMP version that is executed on 1 PE is slower than the sequential 
version from 4% to 14%. This is considered due to the overhead of library calls 
and inserted codes for parallelization. The performance of CG at 2 PE scales 
poorly compared to the others. 

The speedup that is normalized by sequential execution time is shown in 
figure 0 The program LU achieves good performance improvement, 10 times 
speedup at 16 PE. Though the performance of the other programs are improved, 
the speedup is ranging from 3.1 to 5.3 times at 16 PE. Moreover, the performance 
of IS is decreased at 16 PE. 

4.5 Discussion 

The microbenchmark results show that the management policy of the slaves is 
not appropriate for the Cenju-4. The overhead of the allocation and the deallo- 
cation from the list scales poorly. 
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Fig. 7. Speedup of NPB 



The results of the NPB benchmark are not so good as the one shown in the 
previous evaluationjS] that the performance scales well until 32 PE. 

The followings are considered as the cause of the difference: 

— Parallelization 

We parallelize the NPB programs by using the OpenMP directives. The 
parallelization in the previous evaluation is almost the same way as the one 
of the MPI version. 

— The number of shared variables 

The OpenMP attribute of variables, shared or private, are not specified in 
this experiment. The OpenMP compiler detects global variables and allo- 
cated those in the shared address space. On the other hand, the shared 
variables in the previous experiment are optimized by the user, because it is 
based on the MPI version. This results the required shared memory area of 
the OpenMP version is bigger than the one of the previous evaluation. 

— The memory alignment of the variables 

The shared memory allocation code is generated automatically in the 
OpenMP version. The allocation of the global data is done by the Cenju- 
4 library, and it is a cyclic distribution with 128 bytes width. This is the 
default data distribution of the Cenju-4 DSM library that we used. This 
allocation has memory fragmentation, because each DSM line is dedicated 
to only one global variable or array. The alignment of data is depend on the 
generated sequence of allocation, while the alignment is carefully considered 
in the previous experiment. This data alignment is expected to affect the 
result. 

— The language of the program and the compiler 

The C compiler which is based on the GNU C compiler is used for our 
experiment. On the other hand, the previous evaluation uses Fortran version 
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NPB2.3 and the Fortran compiler. This is considered as the main reason of 
the execution time difference of sequential version. 

Although there are undesirable characteristics described above, the paral- 
lelization using the OpenMP has a merit. The same programs on the SMP ma- 
chine can be executed on the DSM system by using our OpenMP compiler. This 
reduces the program parallelization process of the user in various ways. 

5 Concluding Remarks 

This paper presented the implementation and the preliminary evaluation of the 
Omni OpenMP compiler on the Cenju-4. The NEC Cenju-4 is a distributed mem- 
ory parallel computer and has hardware support DSM function. The OpenMP 
programs are analyzed and translated to codes that global variables are allocated 
on the shared address space. The allocation is done in the compiler generated 
function that is executed in the initialization process. The Omni provides trans- 
parent OpenMP environment on the distribute memory system that has hard- 
ware support DSM function. This enables users to reduce their task to porting 
programs. Furthermore, a sequential program and a parallel program are main- 
tained in the same source code. 

The result shows the runtime library of the Omni has to be improved to re- 
duce the overhead of parallel construct. The performance of other library func- 
tions also has to be checked and to be improved the performance. Our evaluation 
using the NPB benchmark programs shows that the OpenMP programs achieve 
speedup on the distributed memory system Cenju-4, though the speedup is not 
scales. 

The Omni now supports data mapping directives for the software DSM sys- 
tem, which enables to schedule the same array element is executed on the same 
PE. The research of this function on the software DSM system indicated that 
the data mapping is important to improve performance of the program. The 
impact of the memory alignment to the application program is one of the main 
subject of the future work. 
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Abstract. The purpose of this benchmark is to test the existence of cer- 
tain optimization techniques in current OpenMP compilers. Examples 
are the removal of redundant synchronization constructs and effective 
constructs for alternative code. The effectiveness of the compiler gener- 
ated code is measured by comparing different OpenMP constructs and 
compilers. If possible, we also compare with the hand coded “equivalent” 
solution. 



1 Introduction 

Implementations of OpenMP are available on almost every shared memory plat- 
form. The portability of applications with OpenMP directives is therefore one 
of the strong points of this standard. Others are the advantages of shared mem- 
ory programming in general: it allows a incremental approach to a fully parallel 
program, and it is also possible to switch between serial and parallel execution 
during runtime. This enables the programmer to avoid the unavoidable overhead 
introduced by the parallelism whenever a serial execution is faster. 

Since the goal of parallel programming is to achieve higher performance, 
the further acceptance of OpenMP will strongly depend on compiler optimiza- 
tion techniques, especially in the field where OpenMP has its possible benefits 
as described above. The importance of benchmarks to measure performance is 
reflected by many application benchmarks (e.g. PCI) and also the well known 
OpenMP Microbenchmarks P^- To evaluate the performance of compilers a com- 
bination of both has to be used |3j. The focus of this benchmark is somewhat 
different. On one hand it tries to avoid the architectural dependency of a di- 
rect measurement of synchronization and scheduling times, on the other hand it 
measures isolated OpenMP related optimizations directly without introducing 
the complex behavior of a complete application. To test whether the proposed 
optimization techniques are already active in current compilers and to judge the 
efficiency of the proposed manual solutions several compilers have been used. 
The compilers from PGI0, Hitachi and SGI are compilers producing native 
code, whereas the OmnijJ and guide 0 compiler are front ends to native com- 
pilers. 
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2 Benchmarks 

The benchmark judges the existence of various optimization techniques by com- 
paring different OpenMP constructs. If a hand coded solutions exists for a con- 
struct its performance is also provided. 

With one exception all constructs use the same work load: 

for(n=0; n<count; n++){ 
f or (i=0; Klength; i++){ 
a[i] = b[i]+c[i] ; 

} 

/* the following is intended to fake the optimizer*/ 
b[n]+=offset*a[n] ; 
c [n] +=of f set*a [n] ; 

} 

The fields a b and c are double precision arrays. The outer loop count is ad- 
justed to guarantee a minimum runtime with a minimum execution count of ten 
repetitions. Some magic tries to avoid that the optimizer removes the outer loop. 
Here a zero is added to some array elements. The value of offset is passed in via 
a command line argument. The inner loop count length is changed to measure 
the size dependent performance. This is not only important for the constructs 
that allow alternative code generation for low loop counts, but also provides use- 
ful informations for other situations. It shows where the parallelization efforts 
pays of, and whether or not there is a memory bottleneck. Since only relative 
performance is important the performance is given in millions loop iterations 
per second in this paper. 



2.1 Overhead of Thread Creation 

Whether or not it is beneficial to execute a part of a program in parallel depends 
on the overhead of thread creation. More precisely it is not the overhead to 
create the threads for one parallel section, but how the implementation tries to 
minimize the necessary time to switch between parallel and serial execution in 
different places. An optimized implementation may decide to keep the threads 
alive at the end of one parallel section in order to recycle them at the beginning 
of the next parallel section. The benchmark consists of the standard working 
loop that sums the two vectors. Three versions of the code are compared against 
each other: 

parallel for: a parallel for construct inside the outer loop. Here, the threads 
are recreated for every iteration of the outer loop, or at least they have to 
be recycled. 

for: the parallel region is created outside the inner loop. The working loop is 
parallelized with a for construct. Here, the threads are created only once. 
This should result in the best performance. 
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for with barrier: because the parallel for also contains an implicit barrier 
at the end of the construct, the third version contains an explicit barrier in 
addition to the for directive. This checks whether the additional overhead 
is mainly created by the barrier rather than the thread creation. 



Table 1. Overhead of parallel for construct, compared to for construct with or 
without barrier. 



Compiler 


Overhead 
parallel for 

[ms] 


Overhead 
for with barrier 

[ms] 


Overhead 

for 

[ms] 


PC PGI 


3.9 


4.6 


2.8 


PC Omni 


6.2 


3.3 


2.4 


PC guide 


4.7 


3.4 


2.3 


SGI native 


11 


11 


10 


SR8K native 


2.3 


3.7 


2.5 



The overhead of the PGI compiler is dominated by the implicit barrier at the 
end of the parallel region. The recycling of threads seems to be very efficient. 
The thread creation on the Hitachi SR8000 is very fast, there seems to be a 
very efficient solution for the parallel for construct. It is even faster than the 
for construct inside a parallel region. All other compilers show the expected 
behavior. 

2.2 Alternative Code 

Once the overhead to create threads is known, the programmer can decide to 
avoid the overhead of parallel execution for small work load. For this purpose 
OpenMP contains the if clause. A parallel region with the if clause is only 
executed in parallel if the condition is true, otherwise the code is serialized. The 
goal is to have the performance of the serial code if it is faster than the parallel. 
To decide whether this goal is achieved the performance of the serial and parallel 
version is supplied. In addition a manual solution is implemented as follows: 

if (condition) { 

#pragma omp parallel 

/* code */ 

} 

> 

else { 

/* code */ 

} 

In Fig. n it is clearly visible that for small loop length the serial code is much 
faster than the parallel version. A surprise is, that only the native SGI and Hi- 
tachi compilers produce code that is as fast as the manual solution. All other 
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compilers are very similar to the Omni OpenMP compiler. Here, the alternative 
code is better than the simple parallel loop, but inferior to the manual implemen- 
tation. It should be noted, that the manual implementation requires not more 
than a simple text replacement. This could easily be implemented by compilers 
that are front ends to native back end compilers. 

2.3 Orphaned Directives 

An orphaned directive is a OpenMP directive that does not appear in the lex- 
ical extent of a parallel construct, but lies in the dynamic extent. If such a 
work-sharing construct is not enclosed dynamically within a parallel region the 
OpenMP standard states “it is treated as though the thread that it encounters 
it were a team of size one” . This allows the user to write code that is used in a 
parallel and serial context. The question is, whether there will be a overhead if 
the code is called from a serial context. 

Several versions are compared against each other: 

scalar: Performance of scalar code. 

scalar with function call: Performance of scalar code with function call. The 
called function contains the working loop. This is done, because many com- 
piler put parallel regions inside functions. This version checks whether a 
potential performance reduction is caused by the introduced additional call, 
orphaned parallel: The working loop contains orphaned OpenMP directives. 

The loop is called from a parallel region with a team of one thread, 
orphaned serial: The working loop contains orphaned OpenMP directives. 
The loop is called from a serial region. 

orphaned manual: Finally, the hand coded version described below of the 
function is used. 

A straight forward manual implementations looks like this: 

if (omp_in_parallel 0 ) { 

#pragma omp for private (i) 
f or (i=0; i<length; i++){ 
a[i] = b[i]+c[i] ; 

} 

} 

else { 

f or (i=0; i<length; i++){ 
a[i] = b[i]+c[i] ; 

} 

} 



With the PGI and Hitachi compilers it makes no difference whether the 
working function is called from a serial region or from a parallel region with a 
team of one thread. Normally the performance within a parallel region is sig- 
nificantly worse. With the exception of the Hitachi compiler, the manual solu- 
tion always results in the best performance, although there is an additional call 




scalar code 
scalar code with function call 
orphaned called from parallel region 
orphaned cay^ed from serial region 
manual solution 



le+01 le+02 le+03 le+04 le+05 



SGI native 



scalar code — 
scalar code with function call -x- 
orphaned called from parallel region -* 
orphaned called from serial region & 
manual solution 




le+01 le+02 le+03 le+04 le+05 



erformance of orphaned directives. Result of the PGI compiler is 
comniler nroduce results similar to the SGI comniler on the right. 
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to omp_in_parallel 0 . The overhead of orphaned directives is always huge. 
In principle the compiler could generate code without this overhead. For any 
function foo with orphaned directives, it could generate an additional function 
foo_scalar containing serial code and a function foo_parallel with parallel 
code. Depending whether the function foo is called inside or outside a parallel 
region the corresponding function call is substituted. The function foo with a 
solution equal to the manual solution could be provided to maintain link com- 
patibility. 

2.4 Removal of Redundant Synchronization 

Three tiny examples check whether the OpenMP compiler removes redundant 
synchronization. 

Parallel Region Merge. This is simply a concatenation of two parallel 
for directives. An optimizing compiler should merge the two parallel regions. 
All compilers with the exception of the Omni and Hitachi compilers seem to 
implement this optimization. At least there is no visible performance difference 
between the two versions. 

Implicit nowait at End of Parallel Region. The third is a for construct 
directly embedded in a parallel region. It is checked whether there is a dif- 
ference between this version, a direct parallel for and a for construct with 
a nowait clause. An optimizing compiler should produce the same code for all 
three versions. Since there is no code between the for loop and the end of the 
parallel region, there is no need for more than one barrier. The Guide Com- 
piler is the only one that shows the desired behavior. The results of the for loop 
with nowait clause show that the PGI and Omni compiler don’t remove the 
redundant barrier. 

Implicit nowait Due to Independent Blocks. It consists of two for con- 
structs, that work on independent arrays. If the compiler detects that the two 
basic blocks are independent of each other, it may remove the barrier between 
the two loops. To increase the difference between a version with and without 
barrier a load imbalance is introduced in the first loop, that is balanced by a 
load imbalance in the second loop: 

f or (i=0; i<length; i++){ 

/* additional work for first half */ 
if ( i < length/2) 

a[i] = b [i] +c [i] toff set*sin(cos (b [i] ) ) ; 
else 

a[i] = b [i] +c [i] ; 

} 

/* same loop on different arrays, additional work on second 

half */ 
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The removal of this barrier requires a detailed code analysis. It is no surprise 
that the front end compilers (Omni and Guide) do not implement this. Also no 
other compiler performs this optimization. 



2.5 Benefit of OpenMP Directives for Sequential Code 

This is maybe the most interesting test. The idea behind it is, that OpenMP 
directives may help the compiler to generate better code because he knows that 
certain preconditions are fulfilled. As an example workload we use the loop 

for(i=0; i<size; i++){ 

a [index [i]] = a [index [i]] + b[i]; 

> 

Because the compiler does not know, how index [i] looks like, he cannot assume 
that the different iterations of the loop are independent. The situation is differ- 
ent, if there is an #pragma omp parallel for. If the loop can be executed in 
parallel it is also subject to optimization techniques like software pipelining or 
vectorization. For large loop counts the performance of the parallel version exe- 
cuted on one thread should therefore exceed the performance of the serial code. 
During the tests there was only one case (native compiler on Hitachi SR8000) 



SR8K native 




Fig. 3. OpenMP directives may help to optimize serial code. Both versions run with 
one thread. The OpenMP directive allows to vectorize the code. 
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where the performance of the loop with OpenMP directives was increased (see 
Fig. ED. However, a possible further cause might be, that it is impossible to 
increase the performance of a loop with indirect addressing on the tested archi- 
tecture. 



Table 2. Summary of optimization techniques of different compilers. 



Compiler 

Version 


PGl 

3.2-3 


Guide 

3.9 


Omni 
1. 2s 


SGI 

7.3.1.1m 


Hitachi 

? 


Optimization 

Alternative code better or equal to manual solution 


no 


no 


no 


yes 


yes 


Optimal Orphaned Directives 


no 


no 


no 


no 


no 


Orphaned Directives better or equal to manual solution 


no 


no 


no 


no 


yes 


Parallel Region Merge 


yes 


yes 


no 


yes 


no 


implicit NOWAIT due to independent blocks 


no 


no 


no 


no 


no 


implicit NOWAIT due to end of region 


no 


yes 


no 


no 


no 


OpenMP directives used as optimization hint 


no 


no 


no 


no 


yes 



3 Conclusion 

This small benchmark contains a collection of various optimization techniques 
that might be implemented in OpenMP compilers. The focus was to avoid archi- 
tecture dependent techniques on one hand and to concentrate on features that 
are crucial to achieve maximum performance, especially in areas where the goal 
is to avoid the parallel overhead whenever a scalar execution is faster. 

Five from the seven proposed optimization techniques are already imple- 
mented in different compilers, this shows that it is a realistic demand to ask 
for the implementation of these techniques. Currently most compilers offer only 
two optimizations, the Hitachi compiler implements three. This demonstrates 
the possible improvements for the current compilers. 
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Abstract. As cluster computing has grown, so has its use for large sci- 
entihc calculations. Recently, many researchers have experimented with 
using MPI between nodes of a clustered machine and OpenMP within a 
node, to manage the use of parallel processing. Unfortunately, very few 
tools are available for doing an integrated analysis of an MPI/OpenMP 
program. KAI Software, Pallas GmbH and the US Department of Energy 
have partnered together to build such a tool, VGV. VGV is designed for 
doing scalable performance analysis - that is, to make the performance 
analysis process qualitatively the same for small cluster machines as it 
is for the largest ASCI systems. This paper describes VGV and gives a 
havor of how to hnd performance problems using it. 



1 Introduction 

Cluster computing has emerged as a defacto standard in parallel computing 
over the last decade. Now, researchers have begun to use clustered, shared- 
memory multiprocessors (SMPs) to attack some of the largest and most complex 
scientific calculations in the world today running them on the world’s 

largest machines including the US DOE ASCI platforms: Red, Blue Mountain, 
Blue Pacific, and White. 

MPI has been the predominant programming model for clusters H21; however, 
as users move to “wider” SMPs, the combination of MPI and threads has a 

* This work was performed under the auspices of the U.S. Dept, of Energy by Univer- 
sity of California LLNL under contract W-7405-Eng-48. 
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“natural fit” to the underlying system design: use MPI for managing parallelism 
between SMPs and threads for parallelism within one SMP. 

OpenMP is emerging as a leading contender for managing parallelism within 
an SMP. OpenMP and MPI offer their users very different characteristics. De- 
veloped for different memory models, they fill diametrically opposed needs for 
parallel programming. OpenMP was made for shared memory systems, while 
MPI was made for distributed memory systems. OpenMP was designed for ex- 
plicit parallelism and implicit data movement, while MPI was designed for ex- 
plicit data movement and implicit parallelism. This difference in focus gives the 
two parallel programming frameworks very different usage characteristics. But 
these complementary usage characteristics make the two frameworks perfect for 
handling the two different parallel environments presented by cluster computing: 
shared memory within a node and distributed memory between the nodes. 

Unfortunately, simply writing OpenMP and MPI code does not guarantee 
efficient use of the underlying cluster hardware. What is more, most existing 
tools only provide performance information about either MPI or OpenMP, but 
not both. This lack of integration in our performance tools prevents users from 
understanding the critical path for performance in their application. To do a 
good job of performance analysis for such codes, users need detailed information 
about the expense of operations in their application. Most likely, message pass- 
ing activity and OpenMP regions are related to the most expensive operations. 
Viewed in this light, the user needs a performance analyzer to understand the 
interactions between MPI and OpenMP. For pure message passing codes, sev- 
eral performance analysis tools exist: Vampir TimeScan |3j, Paragraph P33, 
and others. For pure OpenMP codes there is GuideView 0 and a few other pro- 
prietary tools from other vendors. For a combination of MPI and OpenMP, we 
know of only one other tool - Paraver 

To address the need for an integrated performance analysis tool, KAI Soft- 
ware and Pallas GmbH have partnered with the Department of Energy through 
an ASGI Pathforward contract to develop a tool called Vampir/GuideView, or 
VGV. This tool combines the capabilities of Vampir and GuideView into one 
tightly-integrated performance analysis tool. From the outset, its design targets 
performance analysis on systems with thousands of processors. 

The purpose of this paper is to describe this tool, how it may be used, and 
how it can help pin-point the source of performance problems in MPI/OpenMP 
programs. 



2 Related Work 

A number of existing tools provide performance analysis of message passing 
programs. XPVM |S| can be used to analyze the performance of PVM programs. 
It provides the user an instrumented messaging library and provides a graphical 
user interface (GUI) for visualizing the performance. The user need not insert 
instrumentation in their code because the instrumentation exists already in the 
instrumented library. Vampir CH and Paragraph cm are used for MPI programs. 
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These tools use the MPI profiling interface to capture all MPI calls, then merge 
the trace information into a single trace file. A visualization program later reads 
the trace file and draws a graphical representation of the messaging activity 
between processors. 

There are a number of tools for analyzing the performance of HPF codes, 
which offer a shared memory view to the user, but produce message passing 
code after having been compiled. The Carnival m and Parade jS] systems are 
examples of these tools. Carnival maintains links to the source code from the in- 
strumentation, so that the user can relate performance to the program, although 
it was implemented only on IBM systems, to our knowledge. The PARADE sys- 
tem uses by-hand instrumentation and does post-execution “trace animation” 
through the POLKA animation system. 

The Paradyn PJ tool does dynamic instrumentation of a running program 
by replacing existing instructions with branches to small sections of code called 
trampolines that allow the calling of various instrumentation functions. This 
provides a very low-overhead and flexible method of instrumentation, but the 
focus of that project is different from ours since they do not make extensive use 
of information about the program gathered by a compiler. 

The Ovaltine 0 project has developed a tool to analyze the overhead of 
OpenMP codes, as a way of comparing achieved and achievable performance for 
a particular code. This type of analysis is already present in the Guideview part 
of VGV. 

The Paraver 0 project, like our own, is focused on building a tool for ana- 
lyzing the performance of programs that integrate MPI and OpenMP. Paraver 
is based on a binary instrumenter, that can instrument the MPI functions in 
a program as well as the OpenMP support functions. Any instrumented region 
writes trace records to a trace file, which are then displayed through a GUI. The 
GUI has facilities for user-selected time-scales and zooming in on an arbitrary 
time range in any display window. The goals of the Paraver project are similar 
to those of the VGV project, although it is not clear how important scalability 
is for Paraver. Also, they rely on binary instrumentation, where VGV is based 
on compiler-inserted instrumentation. 



3 Goals of the Project 

The main goals of the VGV project are to create an integrated MPI/OpenMP 
performance analysis tool that is easy to use and that scales well to even the 
largest systems currently available. This new tool is largely based on the existing 
Vampir and GuideView tools. 

3.1 Scalability 

A performance analysis tool faces new problems when it is used for systems with 
thousands of processors. If the tool is not careful, the amount of information 
gathered about the performance of a program can become very large, filling 
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disks or causing large data transfer times. The amount of information displayed 
on-screen can overwhelm the user if it is not displayed appropriately, and on- 
screen display space is limited, anyway. The aggressive goal of the VGV project 
is to quadruple the number of processors that can be analyzed every year for the 
next two years. This year VGV can handle 1000 processors. 

3.2 Integration 

To perform effective performance analysis with VGV, there must be an inte- 
gration of information from Vampir and Guide View. This not only avoids the 
work of manually coordinating output from the two tools, but also provides a 
platform for synthesizing an overall performance report. The performance data 
of both tools should also be integrated with source code information. 

3.3 Effective Data Presentation 

VGV should present an interface which makes the experience of using it for 
solving performance problems on large machines not materially different from 
solving such problems on small systems. The tool should also be able to draw 
the user’s attention to potential performance problems, and help the user locate 
the source of those problems in the program. 

4 Using MPI with OpenMP 

Before describing how VGV intends to meet its goals, we will briefly mention 
some key issues that must be addressed when using MPI with OpenMP. MPI 
may be used with OpenMP, but the two systems have no knowledge of each 
other, so a few basic rules must be followed to ensure that they do not interfere 
with each other. 

In general, MPI implementations are not thread-safe, so MPI functions can 
not be safely used when more than one OpenMP thread is active. Therefore, 
calls to MPI functions should be done either outside OpenMP parallel regions, 
as shown in Figure D or inside a region in which only one thread is active, such 
as a MASTER region or a SINGLE region, as shown in Figure El 

In addition, if MPI calls are used in a single-threaded section of an OpenMP 
parallel region, OpenMP barriers on both sides of the single-threaded section are 
needed to enforce data consistency. This makes sure that the MPI call sees a con- 
sistent view of data, and that the following code section sees any modifications 
to data caused by the MPI call. 

Since message passing calls are hard-coded into the program, the messaging 
structure of the program can not easily adapt to changing patterns of computa- 
tion. Therefore, the messaging will typically be done to support a fixed structure 
in the code. OpenMP, on the other hand, can dynamically adjust the number of 
threads brought to bear on the various parallel loops within the code, so it can 
adjust to changes in the fine-grained structure of the computation. 
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CALL MPI_SEND(A(1) , N, MPI_REAL, 1, tag, comm, ierr) 

CALL MPIJRECV(A(1) , N, MPI_REAL, 0, tag, comm, status, ierr) 
! $omp parallel do shared(A, B, N) 

DO 1=1, N 

B(I) = F(A(D) 

END DO 



Fig. 1. Example of using MPI to exchange data outside an OpenMP parallel region. 

! $omp parallel shared(A, B, N) 

! $omp do 

DO 1=1, N 

BCD = F(A(D) 

END DO 

! $omp barrier ! to insure consistent memory 
! $omp master 

CALL MPI_ALLREDUCE(A, RA, N, MPI_REAL, MPI_SUM, 0, comm) 

! $omp end master 

! $omp barrier ! to insure consistent memory 
! $omp do 

! $omp end parallel 

Fig. 2. Example of using MPI to do a reduction operation inside an OpenMP parallel 
region. 



Typically, a fixed number of MPI processes are used, corresponding to the 
number of nodes being used for the computation. The number of processors 
within each node would represent the maximum number of processors that can 
be brought to bear on a parallel loop being run within a single MPI process. 
In adaptive codes, the amount of work in a particular area of a grid can vary 
widely, so the number of OpenMP processors used in that area might likewise 
vary. If there is only a small amount of work in a given parallel loop, then 
only a small number of processors need be used (less processors used means less 
synchronization overhead) . 

5 Structure of the Tool 

The flow of the integrated tool follows 4 steps: 

1. instrumenting the program at compile time, 

2. generating an integrated MPI/ OpenMP trace file at runtime, 

3. post-run performance analysis for MPI with Vampir, 

4. analyzing OpenMP performance with GuideView. 

This design integrates Vampirtrace and Vampir with the OpenMP compo- 
nents: Guide, the Guide Runtime Library, and GuideView. 
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Like most MPI performance analysis tools, Vampirtrace uses the MPI library 
wrapper interface for instrumentation. As each MPI call is performed, an event 
is written to a trace file. Vampir is the post-run trace file analysis tool. 

Guide is a portable OpenMP compiler for Fortran and C-|— I- that restructures 
source code and inserts calls to the Guide Runtime Library. The Guide Runtime 
Library layers on top of threads to implement OpenMP functions. The library 
is instrumented to call clock timers around all the significant OpenMP events. 
At the end of a run, the information gathered from these timers is written into 
a statistics file. 

The heart of the MPI and OpenMP integration occurs at runtime. The instru- 
mentation of OpenMP and MPI requires coordination. This is achieved by adding 
OpenMP events to the Vampirtrace API. The Guide Runtime Library is mod- 
ified to instrument interesting OpenMP events. For each interesting OpenMP 
event, the execution times are put into a data structure that is time-stamped 
and sent to the trace file. 

In the next phases of the project, dynamic instrumentation will become more 
important. Then, the user will be able to identify at run time which parts of the 
program should be instrumented and traced to get a closer and more focused 
view to performance bottlenecks. The dynamic instrumentation will combine 
with and complement the compiler-inserted instrumentation. 



6 Usage of the Tool 

Once an integrated MPI/OpenMP trace file has been created during the appli- 
cation run, it can be viewed by an integrated user interface. Vampir shows the 
trace file events ordered by time in the timeline display. When an MPI process 
executes an OpenMP region, a curvy line glyph appears at the top of that pro- 
cess’ time line. The user can select that glyph to view that OpenMP region, or 
can select a set of MPI processes or a time line section for OpenMP analysis. 

OpenMP analysis aggregates the OpenMP data structures from all the trace 
file events in the selection. Then the aggregated data is written to a file where a 
Guide View server process reads the file. 

GuideView displays the OpenMP regions for each MPI process as a separate 
set of OpenMP data. In this way, the user can use GuideView tools to select a 
subset of the hundreds of MPI processes that may be running and sort by any 
OpenMP performance measure. Examples of OpenMP performance measures 
for sorting are: scheduling imbalance, lock time, time spent in a locked region, 
and overhead. The user can specify that GuideView show the top or bottom n, 
where the user specifies n. This mechanism allows a user to compose compound 
performance queries by sorting on one criteria, filtering the top responders, and 
then sorting by another criteria. 

The user can also view the subroutine profile for one or a selection of MPI 
processes within GuideView. This can be viewed as inclusive to allow the user to 
understand the call tree structure, or exclusive to understand which subroutines 
consume the most time. 
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One of the important uses of the tool is to locate regions of the program 
where some processors spend much time waiting while others are doing useful 
work. This is referred to as a load-imbalance. From the color-coded display, the 
user can determine how much time each processor spends waiting. 

Besides the new analysis features for the OpenMP parts, the usual analy- 
sis features of Vampir can be used for the whole program including all parts. 
As a new feature, hardware performance monitor information is now available 
for further inspection. Current processor architectures usually offer performance 
monitor functionality, mostly in the form of special registers that contain various 
performance metrics like the number of floating-point operations, cache misses 
etc. The use of these registers is limited because there is no relation to the pro- 
gram structure: an application programmer typically does not know which parts 
of the application actually cause bad cache behavior. By extending the Vampir 
trace format, this data is now available inside the Vampir windows to provide 
identification mechanisms for functions with low performance properties. 

As the systems under investigation could have thousands of processors, the 
scalability requirement has introduced a couple of new hierarchical concepts for 
the Vampir windows. Especially, a flexible grouping concept has been developed 
to show data just related to the right level of abstraction: showing information 
for all processes, showing just accumulated data for the SMP nodes, showing 
just information for the master thread, etc. This feature enables end users to 
easily dive into the important program regions that have performance problems. 

A further key use of VGV is source code browsing. The source code associated 
with any part of the performance data may be brought up in a browsing window 
by clicking the mouse on the data display. 



7 Finding Performance Problems with VGV 

Figures!^ and ^ show the performance of the MPI/OpenMP version of the pro- 
gram SWEEP3D. In a hypothetical experiment, a user may have run this pro- 
gram on two MPI processes, with four OpenMP threads in each, and discovered 
that it exhibits very poor speedup. The user could then run VGV and begin 
with a whole-program view of the performance, such as the frame at the top 
of Figure 0 This view shows the execution activity for each MPI process as a 
horizontal bar. The messaging activity between the processes is shown as lines 
connecting the two process bars. The regions during which OpenMP activity 
exists is indicated as regions where the wiggle glyph appears at the top of the 
process bar. As can be seen from the Figure, nearly all of each process bar is 
covered by OpenMP parallel regions. So where is the problem? 

The problem may be investigated by adding OpenMP detail to the MPI 
activity. The middle frame in Figure 0 adds OpenMP thread activity to the 
MPI information. We can begin to advance a theory about the cause of the 
problem from this frame, because there seem to be a large number of small 
OpenMP execution regions, separated by large gaps. 
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Fig. 3. Performance display frames for the SWEEP3D program, run on two MPI pro- 
cesses with four OpenMP threads on each. The “cnrvy line glyph” line at the top of a 
bar indicates that an OpenMP parallel region is active. The top frame shows the over- 
all timeline of the two MPI processes. The middle frame shows the same information, 
with the OpenMP threads displayed as well. The bottom frame shows the OpenMP 
thread activity from a single parallel region. 




48 



J. Hoefiinger et al. 



By zooming in on a single OpenMP region (as in the bottom frame of Fig- 
ure 0 ), we see that the parallel execution time of each “helper” thread is sepa- 
rated by a gap at least as large as the execution time itself. 

This information seems to point to a large number of relatively small paral- 
lel regions, dominated by thread startup and shutdown times. To confirm that 
theory, we can look at aggregated thread information, as in Figure 0 In the 
top frame of that Figure, we see the aggregated whole-program information for 
all threads, and the speedup graph for the code. This information corroborates 
the view that a large fraction of the time spent in each process is “sequential” 
(the left-most region in each bar). This corresponds to the thread startup and 
shutdown times. The parallel execution time is indicated in each right-most bar 
region, and is a small fraction of the total time. The speedup graph shows that 
the potential for speedup is very poor in this code. 

The aggregated performance information for the whole program, displayed 
per thread (as in the middle frame of Figure E|), confirms this view. All threads 
are dominated by sequential execution time. Finally, in the bottom frame of 
Figured we see the thread information displayed for a single OpenMP parallel 
region, obtained by clicking on the glyph for a single parallel region. 

Each VGV display contains implicit links to the original source code for the 
program generating the performance data. Any of the timeline regions may be 
“clicked” to obtain a window positioned at the source code that produced the 
data. By using this feature, it is possible to display the parallel loops within 
SWEEP3D that we are judging to be “too small”. From the source code, we 
could determine how to increase the amount of work done in the parallel regions. 

Comparative analysis can also be done by loading trace files from more than 
one program run. VGV will plot the results together in any of its frames, to 
make comparison easy, as in Figure 0 This could allow us to experiment with 
various input data sets, to see how each affected the performance of the program. 
In the Figure, three program runs are being compared, the serial version of the 
program, a 2 (MPI) x 2 (OpenMP) version and a 4 (MPI) x 1 (OpenMP) version. 

8 Scalability of the Tool 

Prior to this project, GuideView already used light-weight summarization tech- 
niques to analyze performance statistics for the OpenMP processors. Vampir- 
trace, on the other hand, wrote trace records to a single trace file for every 
MPI call. This produces a potentially very large trace file that must not only be 
stored, but also completely read and analyzed to provide the user with a display. 
To be scalable, VGV must adequately address the following issues: 

— the disk storage requirements of an event-based tracing tool could become 
enormous for long runs with large numbers of processors, 

— workstation screens have limited space for displaying performance informa- 
tion, 

— simply finding a potential performance problem may be very hard in the 
blizzard of information potentially generated from a massive run. 
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Fig. 4. Performance display frames for the SWEEP3D program, run on two MPI pro- 
cesses with fonr OpenMP threads on each. The top frame shows the whole program 
view, displaying aggregated information from all threads and processes. The middle 
frame shows the threads view of multiple parallel regions. The bottom frame shows 
the threads view of a single parallel region. In each view, the frames show that the 
vast majority of time is being spent in sequential execution (the left-most bar region). 
Parallel execution is the right-most bar region in all cases. 
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Fig. 5. Comparing three program runs with VGV. 



Some of these issues have already been addressed in the current version of 
VGV. Others will be implemented during the remainder of the project. 

VGV will attempt to reduce the size of the trace file through several means: 

event compression - Specialized trace records can be used for some events 
and encoded to save space. Gollective communication events, which usually 
require a full trace record for every process can be reduced to a single trace 
record and a series of small records, one per task. Also, source code line 
numbers can be encoded to save space in each trace record, 
event combination - Events occurring commonly together can be replaced 
by a single event. Very short events which are issued until the MPI state 
changes (e.g. MPIJProbe / MPLTest) can be replaced by a single event 
covering the entire interaction. 

event summarization - Some events can be summarized by maintaining only 
min/max/average values and discarding the events, 
structured trace file - The single trace file can be replaced by multiple, hier- 
archically structured files. This also saves processing time because a top-level 
summary file can processed much more quickly than can the whole original 
trace file. This allows the user to see a summary then drill down to other 
levels in the hierarchy for display. 
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tracing/instrumentation control - Tracing can be disabled or enabled ac- 
cording to a variety of criteria. The user could place enable/disable trace 
calls in the code, or could select specific events to enable/disable, or could 
trace only certain MPI processes, or a variety of other criteria. 

The on-screen presentation of the performance information can be made scal- 
able through vertical scrolling of MPI process time-line information, as well as 
back-to-front stacking of time-lines. 

VGV will use data reduction to attempt to identify potential performance 
problems for the user. Statistical analysis of the data for a single interval of the 
user’s program can identify processes or processors that require unusual (high or 
low) amounts of various resources (e.g. cache misses, time, memory access time) 
and mark them for the user. 

We have found that the execution time of the analysis tool can be a major 
fraction of the time required for the tool. For this reason, the tool will be parti- 
tioned into a display component (DC) and a trace processing component (TPC). 
The TPC can be parallelized and run on a small number of processors. The DC 
can be potentially multi-threaded. 

9 Future Directions for VGV 

VGV is still in the design and development stage, since we are only in the second 
year of the three-year project. We expect significant improvements in the system 
over the remainder of the project. 

Possible future directions for VGV include: 

— Two way interaction between Vampir and GuideView. When the user selects 
something in the GuideView display, VGV should map that back to the 
Vampir Timeline. For example, when selecting a parallel region, a menu 
item would zoom in on the first instance of the parallel region in the time 
line so that you could see the actual event distribution inside the parallel 
region. 

— Integrated resource management. Presently, Vampir and GuideView are run 
in separate processes. This gives them no ability to co-manage the resources 
they use. If they were combined in the same process, they could coordinate 
their use of memory, use of the screen and use of disk space. They could also 
react as a single unit to signals and user requests. 

— Move away from Java to some other, more portable, window manager. Java, 
contrary to popular belief, has proven to be inconsistently portable when it 
comes to managing the screen display. This is worse on older platforms and 
can be attributed to some of the early Java implementations. A more mature 
system, such as Motif, may be more portable. 

10 Conclusion 

Vampir/GuideView is intended to be a flexible, easy-to-use tool for finding 
performance problems in programs written with a combination of MPI and 




52 



J. Hoeflinger et al. 



OpenMP, that run for extremely long times and use thousands of processors. 
We know of no other commercial tool targeted at MPI/OpenMP, and certainly 
none with the ability to handle the massive runs common on the ASCI machines. 
During the remaining two years of its development for the ASCI project, we be- 
lieve that it will become a tool that can be used for the largest ASCI clusters, 
and will help users quickly pin-point performance anomalies in their codes. 
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Abstract. In this paper we present OMPtrace, a dynamic tracing mech- 
anism that combines traditional tracing with dynamic instrumentation 
and access to hardware performance counters to create a powerful tool 
for performance analysis and optimization of OpenMP applications. Per- 
formance data collected with OMPtrace is used as input to the Paraver 
visualization tool for detailed analysis of the parallel behavior of the ap- 
plication. We demonstrate the usefulness of OMPtrace and the power 
of Paraver for tuning OpenMP applications with a case study running 
the US DOE ASCI SweepSD benchmark on the IBM SP system at the 
Lawrence Livermore National Laboratory. 



1 Introduction 

OpenMP has emerged as the standard for shared memory parallel programming, 
allowing users to write applications that are portable across most shared mem- 
ory multiprocessors. However, in order to achieve high performance on these 
systems, application developers still face a large number of application perfor- 
mance problems, such as load imbalance and false sharing. These performance 
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problems make application tuning complex and often counter-intuitive. More- 
over, these problems are hard to detect without the help of performance tools 
that have low intrusion cost and are able to correlate dynamic performance data 
from both software and hardware measurements. 

In this paper, we describe OMPtrace - a dynamic tracing mechanism that 
combines traditional tracing with dynamic instrumentation and access to hard- 
ware performance counters - to create a powerful tool for performance analysis 
and optimization of OpenMP applications. 

OMPtrace is built on top of the Dynamic Probe Class Library (DPCL)|2|, 
an object-based C-|— I- class library and runtime infrastructure that flexibly sup- 
ports the generation of arbitrary instrumentation, without requiring access to the 
source code. DPCL allows the instrumentation of the OpenMP runtime systems, 
providing the flexibility to measure the overhead of initialization and finalization 
of parallel regions. For detailed analysis of the parallel behavior of the applica- 
tion, OMPtrace data is, then, analyzed with the Paraver visualization tool|S|. To 
demonstrate the power of OMPtrace and Paraver, we analyze the performance 
of the SweepSD application^ as a case study. 

The remainder of this paper is organized as follows. In Sectional we briefly 
describe the main DPCL issues. In Section|3we discuss the OMPtrace interface. 
In Section 0 we present a case study where we describe the steps followed to 
analyze the SweepSD application and how OMPtrace and Paraver were useful 
to identify potential improvements. Finally, our conclusions are summarized in 
Section 0 

2 The Dynamic Probe Class Library 

Traditionally, instrumentation systems have had to strike a balance between min- 
imizing instrumentation overhead and maximizing the amount of performance 
data captured. One approach to managing instrumentation overhead is to limit 
both the number of events recorded and the size of those events. However, this 
could mean that key events may not have been recorded. Likewise, if too much 
instrumentation is inserted, the overhead may be so high that it is no longer 
representative of the un-instrumented program’s execution behavior. 

Another challenge is that many instrumentation systems require that pro- 
grams be re-compiled after being instrumented. While this is generally possible, 
for large applications it can be time consuming. Even worse, for third party 
libraries and applications users where the source code may not be available, 
re-compiling will not be possible. An alternative is to allow a program to be 
modified while it is executing, and thereby eliminate the need to re-compile, 
re-link, or re-execute the program. 

Dynamic instrumentation provides the flexibility for tools to insert probes 
into applications only where it is needed. The Dynamic Probe Class Library, 
developed at IBM, is an extension of the dynamic instrumentation approach, 
pioneered by the Paradyn group at the University of Wisconsin|S| . DPCL is built 
on top of the Dyninst Application Program Interface (API)0. Using DPCL, 
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a performance tool can attach to an application, insert code patches into the 
binary and start or continue its execution. Access to the source code of the 
target application is not required and the program being modified does not need 
to be re-compiled, re-linked, or even re-started. 

DPCL provides a set of C-I--I- classes that allows tools to connect, examine, 
and instrument a spectrum of applications: single processes to large parallel ap- 
plications. DPCL is composed of a client library, a runtime library, a daemon, 
and a super-daemon. End user tools can be created with the client library. The 
runtime library supports instrumentation generation and communication. The 
daemon interfaces with the Dyninst library to instrument and manage user pro- 
cesses; a super-daemon manages security and client connections to these DPCL 
daemons. With DPCL, program instrumentation can be done at function entry 
points, exit points, and call sites. 



3 The OMPtrace 

The integration of DPCL into OMPtrace is based on the fact that the IBM 
compiler translates OpenMP directives into function calls. Figure d shows, as 
an example, the compiler transformations for an OpenMP parallel loop. The 
OpenMP directive is translated into a call to a function from the OpenMP 
runtime library (“xlsmpParDoSetup”), which is responsible for thread manage- 
ment, work distribution, and thread synchronization. The loop is transformed 
into a function (“AOOLl” in the example in Figure^ that is called by each of the 
OpenMP threads. 

Since DPCL allows the installation of probes at function call entry and exit 
points, as well as before and after a function call, the OMPtrace tool installs two 
pairs of probes for each parallel region in the target application. As shown in 
Figure O the first pair (DPCL probe (1)) is inserted before and after the call to 
the OpenMP runtime library function, while the second pair (DPCL probe(2)) 
is inserted at the call entry and exit point of the parallel region. Given these 
two pairs of probes, one can measure the overhead of starting and terminating a 
parallel region. Additionally, a third pair of probes (DPCL probe (3)) is inserted 
at the call entry and exit points of each function that contains an OpenMP 
parallel region. 

Figure 0 displays the startup procedure executed by OMPtrace. The tool 
communicates with the DPCL daemon (1), which in turn acts on the application 
binary (2). Probe installation (3) is executed in two steps. First, OMPtrace 
requests the DPCL communication daemon to load the tracing module into 
the target application. This module contains the functions that will be called 
by the probes. Once the tracing module has been loaded, OMPtrace requests 
the communication daemon to inserts the probes into the application. After 
the probes are installed, OMPtrace starts the application. Notice that nothing 
precludes OMPtrace from attaching to a running application and execute the 
same procedure. We are considering this feature for future work. 
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Fig. 1. Compiler transformations for an OpenMP parallel loop. 
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Fig. 2. DPCL probes on functions that contain parallel regions. 
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Fig. 3. Application startup with OMPtrace. 



Similarly, OMPtrace can be used to instrument locks that are used to en- 
sure mutual exclusion in the application. In this case, dynamic instrumentation 
is placed before and after the OpenMP functions that are called to handle the 
locks, generating an event trace every time that a thread enters in one of the fol- 
lowing four states: trying to acquire a lock, lock acquisition, starting to release 
a lock, and lock release. This instrumentation is useful to measure lock over- 
head, contention in critical sections, and the actual pattern of lock acquisition. 
Since this instrumentation may introduce a significant overhead, especially for 
very small critical sections, it is only activated when specified by the user with 
command line flags when executing OMPtrace. It is our experience that even 
in cases where the overhead is significant, the information on the pattern and 
interaction between threads that this tracing facility provides is very helpful to 
improve the performance of the parallel program. 

Another OMPtrace feature is the ability to automatically access hardware 
counters. The IBM PowerS processor provides 8 counters, each one able to count 
a number of hardware events. OMPtrace allows users to select any valid combi- 
nation of hardware events, via an environment variable. By default, OMPtrace 
uses a standard set of events to count instructions, floating point operations, 
fused multiply adds (FMAs), and loads. When the hardware counters option is 
activated, OMPtrace emits event records at the entry and exit of every instru- 
mented point in the program, identifying the hardware events being collected and 
for each event, the count between the current and the previous tracing point. 
In order to avoid excessive overhead and reduce trace file size for the default 
analysis, this feature is also only activated via a command line flag. 

One of the known weakness of hardware performance counters is that they 
only provide raw counts, which does not necessarily help users to identify which 
events are responsible for bottlenecks in the program performance. However, 
Paraver has a very flexible mechanism to compute and display a large number 
of performance indices and derived metrics from the information emitted into 
the trace by OMPtrace. Thus, the hardware counter information included in the 
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trace file can be later processed by Paraver to generate a large number of perfor- 
mance indices, which allows users to correlate the behavior of the application to 
one or more of the components of the hardware. For example, Paraver can dis- 
play as a function of time for a given routine (or interval) the quotient between 
the number of LI misses (as reported by the event at the exit of the routine) 
and the duration of the routine. Indices such as LI misses per second or floating 
point operations per second can be visualized as a function of time. Addition- 
ally, a second level of semantic functions can be obtained by combining (i.e., 
adding, dividing, etc.) the functions of time computed directly from the records 
in the trace as stated above. We call this feature derived windows. For example, 
starting with a window that looks at “cycles” to compute the number of cycles 
for each function (or interval) and other window that looks at the “instructions” 
it is possible to derive an IPC window by dividing those two windows. This 
derived window will display the actual Instructions per Cycle obtained for each 
interval of the application, which can be useful for example to compare with the 
theoretical limit of the machine (4 issue in the PowerS case). 

An interesting way to use these derived windows is to build performance 
models of the processor and try to explain the performance of the application 
based on these models. For example, one can compute the theoretical IPC limit 
considering just the number of floating point operations and the number of 
misses by taking into account that only two FPUs are available and assuming a 
certain miss cost. Comparing this model with the observed IPC gives an insight 
on whether the performance is limited by the number of FPUs or by the cost of 
the cache miss. 

In addition to installing these dynamic probes, OMPtrace accepts static in- 
strumentation placed by the user, for tracing of other functions or code regions 
in the program. During program execution, OMPtrace generates trace records. 
These records contain absolute times from the activation of the instrumented 
points in the program during the parallel execution, as well as, the information 
gathered for these points (for example, data from hardware performance coun- 
ters) . Each record represents an event or activity associated to one thread in the 
system. At the end of execution, these traces are combined into a single Paraver 
trace file, in order to convert these “punctual’ events into “interval values” . 



4 Case Study 

In this section we describe the steps followed to analyze an application and 
how OMPtrace and Paraver were useful to identify potential improvements. We 
observe that performance tuning of any large application is in general a never- 
ending task, with new potential improvements arising just after a previous one 
has been implemented. Thus, our intent was not to optimize the performance 
of the application to the utmost possible level. Instead, we focused on the way 
Paraver was helpful in the process. 

In this case study, we used the US DOE ASCI SweepSD benchmark, which 
uses a multidimensional wavefront algorithm for “discrete ordinates” determin- 
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istic particle transport simulation. Figure 0 displays the SweepSD major data 
structures and the iteration space. The core computation presents reductions in 
all directions (i, j, k, and m); thus posing some problems to parallelization. To 
solve these problems, SweepSD benefits from multiple wavefronts in multiple di- 
mensions, which are partitioned and pipelined on a distributed memory system. 
The three dimensional space is decomposed onto a two-dimensional orthogo- 
nal mesh, where each processor is assigned one columnar domain, as shown on 
Figure 0(a). SweepSD pipelines the K dimension, exchanging messages between 
processors as wavefronts propagate diagonally across this SD space in eight di- 
rections. 




Fig. 4. SweepSD major data structures and iteration space. 



Within each MPI process domain further decomposition of work among sev- 
eral threads can be achieved with OpenMP. The approach is to parallelize the 
execution of the planes of a diagonal wavefront that traverses the sub-cube com- 
puted by each MPI process. Each such plane is inherently parallel as each of its 
points contributes to a different reduction in each of the i, j, and k directions, 
as shown in Figure 0(b). This is nevertheless at the expense of additional index 
computations and triangular loop trip count, which causes significant overhead 
both in terms of index computations and of OpenMP run time library overhead. 

Figure El displays the computational flow of Sweep 3D in its original version, 
which we will refer here as version. However, as an alternative approach 

described in the source distribution, the “do idiag” and “do jkm” loops, shown 
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Fig. 5. (a) MPI parallelization structure and (b) sub-cube diagonal parallelization 
structure with OpenMP. 



in Figure 0 can be replaced by a tripled nested loop (“do m”, “do k”, and 
“do j”), which we will refer here as “mfcj” version. 

Our trace collection and analysis was performed with a small problem size, 
using a cube of dimensions 50x50x50, running on one SP Nighthawk II node 
with 16 375 MHz Power3-|- processors. The performance observations were then 
validated running a mixed MPI/OpenMP code with a larger problem size, using 
a cube of dimensions 300x300x100, on 12 SP Nighthawk I Nodes, each node 
with 8 222 MHz Power3 processors. 

As described above, the original MPI version is parallelized in two levels, 
along the “I” and “J” dimensions. Table Q] presents the elapsed times in sec- 
onds for the MPI versions corresponding to different partitioning of the global 
iteration space, and the elapsed time for the OpenMP “diag” version. The first 
number in the decomposition indicates the number of processors used for the 
partitioning across the I dimension, while the second number indicates the num- 
ber of processors for the J dimension. 

We observed that on 6 processors, the best MPI decomposition ran in 3.69 
seconds, while the OpenMP version ran in 7.78 seconds. The OpenMP per- 
formance with 12 processors was almost three times worst than the best MPI 
performance. In order to identify the reasons for this performance difference, we 
obtained two sets of trace of the MPI and the OpenMP versions, one using the 
default set of hardware counters to measure communication and synchronization 
overheads, and the other counting cache misses (level 1 and 2) and TLB misses 
to investigate locality issues. 
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DO iq=l,8 
DO mo=l ,mmo 
DO kk=l,kb 
RECV E/W 
RECV N/S 

DO idiag=l , jt+nk-l+mmi-1 
DO jkm=l ,ndiag 

j,k,m = f (idiag, jkm) 
DO 1=1,1^ 

ENDDO 

DO i=i0,il)i2 

ENDDO 

DO 1=1,1^ 

ENDDO 
DO 1=1,1^ 

ENDDO 
ENDDO 
ENDDO 
SEND E/W 
SEND N/S 
ENDDO 
ENDDO 
ENDDO 



! octants 

! angle pipelining loop 
! k-plane pipelining loop 
! recv block I-inflows 
! recv block J-inflows 
! JK-diagonals with MMI pipelining 
! I-lines on this diagonal 
! map to j , k, and m indices 
! source (from Pn moments) 

! Sn eqn 

! flux (Pn moments) 

! DSA face currents 



! send block I-outflows 
! send block J-outflows 



Fig. 6. Sweep 3D control flow 



Table 1. Elapsed time in seconds for “diag”, running the small problem with OpenMP 
and with MPI. 



NB Domains 


OpenMP time 


Decomposition 


MPI time 


6 


7.78 


1x6 


3.97 


2x3 


3.69 


3x2 


3.71 


6x1 


4.47 


12 


6.55 


1x12 


3.50 


2x6 


2.74 


3x4 


2.21 


4x3 


2.25 


6x2 


2.97 


12x1 


3.98 



Using Paraver to compute the total useful computation, we observed that 
both versions were loosing a similar percentage of time in synchronization and 
communication. The percentage of time inside numerical computation routines 
was around 65% for both runs. In the case of OpenMP this low percentage 
was partially due to the fine granularity of the triangular loops, and because it 
still executed some sequential computation, since only the major computational 
loop was parallelized. On the other hand, we observed that the OpenMP version 
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had less LI misses (2137 per ms) and TLB misses than the MPI version (4153 
LI misses per ms), but much more L2 misses (1356 per ms for the OpenMP 
version versus only 133 per ms for the MPI version). The rate of L2 misses 
per millisecond for one traversal of the 3D iteration space in “diag” is shown 
in Figure [ 7 | and Figure 0 for the OpenMP and MPI versions respectively. In 
these figures, darker gray (blue) represents large values, while white (green) 
corresponds to low values. The areas with low values in Figure Q correspond to 
intervals where the threads are waiting for work or synchronizing. Hence, our 
optimization efforts concentrated on improving locality of the OpenMP version 
and minimizing coherence invalidations. 




Fig. 7 . L2 misses per milliseconds for the OpenMP execntion of “diag” . 




Fig. 8. L2 misses per milliseconds for the MPI execution of “diag”. 



Taking into account that shared memory is inherently more efficient than 
message passing for fine grain synchronization, we implemented an OpenMP 
parallel version based on the “mkj” version, where the outer loop was parallelized 
and the internal precedence was enforced by some synchronization mechanism. 
Two approaches were implemented. The first was the version “ccrff’, which uses 
the CRITICAL OpenMP directive for the implementation of the reduction. The 
result was a fairly high contention on the lock, a behavior that could be visualized 
with Paraver, as shown in Figure 0 which displays the behavior of the critical 
section access. The long regions in gray (red) correspond to the time threads are 
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trying to get a lock that is already taken. The dark (blue) periods correspond 
to threads using the lock. White (green) is when a thread releases the lock and 
light gray (light blue) corresponds to execution outside of the critical section. As 
can be observed, the sequence of accesses does not follow a specific pattern, and 
the waiting time to obtain a lock has a large variance. Using the quantitative 
analysis module of Paraver we measured the average waiting time to be 170 
microseconds with a standard deviation of 172. 




Fig. 9. Lock access pattern in the “ccrit” version. 



The second synchronization mechanism, which we refer to “cpzpe”, was im- 
plemented based on shared arrays and busy waiting. In this version the itera- 
tions of the parallelized loop (“do m”) are interleaved across threads. When a 
thread finalizes its part of computation involved in the reduction it signals to 
the following thread to continue with its part. In this approach, the reductions 
are executed in sequential order (although different threads compute different 
parts). Thus, in this case, since the sequential order of the computation of the 
reductions are preserved, the numerical results are identical to the sequential 
execution, independently of the number of processors used in the parallel com- 
putation. This is an important advantage compared to the “ccrit” version, where 
the critical sections used to assure atomicity of the reduction updates do not pre- 
served their order; thus resulting in numerical differences between runs. In this 
version we observed that the synchronization overhead was very low and a fairly 
good pipelining was achieved. 

A comparison of the single processor run of the “diag” and “mkj” versions 
showed that the “mkj” version was significantly slower. Using Paraver traces with 
the sequential application we observed that the first, third and fourth “do i” 
loops were touching the variables “flux”, “src”, and “face” and incurred most 
of the level 2 cache misses. However, analyzing the source code, it could be 
observed that interchanging the “do m” loop inwards would reduce misses in 
the “do i” loops. Besides the locality problems, parallelizing the m dimension 
also has the problem of the small trip count of the loop (only 6), which limits 
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parallelism. Taking into account data locality and trip count considerations de- 
scribed above, we interchanged the loops, creating the version “kjmi\ In order to 
achieve good pipelining overlap in this version, the “do k” loop was parallelized 
with SCHEDULE (STATIC , 1), which means totally interleaved. This causes matrix 
“Phikb” to be circulated between processors for each k, generating level 2 cache 
invalidations and a slightly higher miss ratio than the MPI version. Thus, in or- 
der to increase the reuse of “Phikb” we introduced a final modification (version 
“ifjfcmz”) where the k loop was strip-mined and interchanged according to the 
version name. 



Table 2. Elapsed time in seconds for the different OpenMP versions. 



version 


1 


2 


3 


4 


5 


6 


8 


9 


10 


11 


12 


13 


14 


15 


16 


ccrit 


28.26 


24.41 


26.84 


26.47 


29.28 


30.34 










30.43 










cpipe 


25.63 


18.45 


13.01 


12.53 


10.06 


7.67 










7.76 










diag 


17.28 


13.09 


11.40 


9.64 


8.50 


7.78 










6.55 










kjmi 


14.86 


10.01 


7.35 


5.82 


4.89 


4.34 


3.62 


3.38 


3.09 


3.04 


2.88 


2.69 


2.69 


2.64 


2.53 


Kjkmi 


14.91 


8.47 


6.35 


4.91 


4.24 


3.58 


2.90 


2.81 


2.78 


2.65 


2.29 


2.22 


2.16 


2.19 


2.15 



A performance summary of the different OpenMP versions of the program, 
running the small problem size is presented in Table El The numbers show the 
inefficiency and lack of scalability of the “ccrit” version. The overhead of the 
mutex lock and unlock needed to protect the critical section can be observed, 
when comparing the times for just one thread in the “ccrit” and “cpipe” ver- 
sions. The huge contention at the lock shown in Figure El causes the scalability 
problems. 

Although “cpipe” performed better than “ccrit”, when comparing to the 
other three versions we observe that “cpipe” also had poor locality behavior, 
and scalability problems. These problems occur mainly because the parallel loop 
on “m” has only an iteration count of 6, which results in a poor pipelining. 

The scalability of “diag” is limited, as mentioned above, due to the very high 
number of L2 misses caused by false sharing and the variable trip count of the 
parallelized loops. The overhead of opening and closing such parallel loops with 
very small trip counts for the diagonal planes at the corners of the cube also 
contributes to the poor scalability of this version. 

The two final versions show much better behavior for just one thread, an 
effect that not only benefits the OpenMP code, but also the MPI. Scalability is 
fair, and the performance achieved is equivalent to that of pure MPI as reported 
in Table Q In some cases, as “Kjkmi” running 6 threads, as well as in other 
experiments we have performed with larger problem sizes, we observed that the 
OpenMP versions were marginally better than the pure MPI version. 

Table 0 presents the best elapsed time over two runs using the larger problem 
size (300x300x100) on 12 IBM SP Nighthawk I Nodes at Lawrence Livermore 
National Laboratory. Notice that this table shows only a few combinations of 
MPI tasks and OpenMP threads that were selected from the large space of 
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Table 3. Elapsed time in seconds, for the large problem size, running on 12 IBM SP 
Nighthawk I Nodes. 



MPI 

Tasks 


Version 


Threads 


0 


2 


3 


4 


5 


6 


7 


8 


12 


kjmi 

Kjkmi 










56.61 

56.62 


55.33 

57.00 


52.96 

56.70 


40.55 

57.14 


24 


kjmi 

Kjkmi 






49.87 

65.69 


40.39 

45.61 










48 


kjmi 

Kjkmi 




39.41 

45.69 














84 

96 


diag 

diag 


41.68 

40.93 

















possible configurations. Also, for each MPI task, only one decomposition was 
considered. Therefore, based on the results from the small problem size, where we 
observed that the MPI performance is heavily dependent on the decomposition, 
one should be aware that the MPI decomposition chosen for these experiments, 
was based on the observations from the small problem size, but the times might 
not necessarily represent the best MPI performance for this problem size. 

We observe that when running the large problem size with all 96 processors, 
we were able to confirm the analysis derived from the small problem size for the 
mixed MPI/OpenMP version of “kjmi”. This version performed slightly better 
than the pure MPI version with all three combinations of tasks and threads 
(namely, 48/2, 24/4, and 12/8). On the other hand, the performance of the 
“Kjkmi” version did not perform as well as expected, and in the only situation 
where its performance was comparable to the “kjmi” version, it did dot scale 
well with more than 5 threads. Thus, more analysis is necessary to understand 
its performance behavior. 

When running the mixed mode approach, we observed some conflicts in the 
scheduling of the two parallelization strategies (MPI and OpenMP). This prob- 
lem can be observed in Figure Eni which shows for each thread, the pattern of 
computation from the iterations of the parallel loop, when running “kjmi” with 
the small problem size, using 2 MPI tasks and 8 OpenMP threads per task. In 
this figure, where each dark (blue) area between two flags corresponds to one 
iteration of the loop, we can observe an unbalance between threads inside of 
each MPI task. The reason for this unbalance is due to the MPI pipelining that 
was set to have “k” dimension of 10 planes. Hence, the parallel loop had only 
10 iterations, and when scheduling such number of iterations among 8 threads, 
two of then will perform two iterations, while the other six threads will perform 
only one iteration and then wait for the first two to finish. 

Therefore, another important observation of this experiment was that when 
mixing different programming models, it is of key importance to analyze the 
scheduling decisions taken by the different parallelization strategies. It is fairly 
easy for these strategies to interfere with each other, and without an analysis 
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Fig. 10. Scheduling of the OpenMP loop iterations 



tool such as OMPtrace and Paraver it may be difficult to understand the effects 
in performance. 



5 Conclusions 

In this paper we described OMPtrace, a dynamic tracing mechanism that com- 
bines traditional tracing with dynamic instrumentation and access to hardware 
performance counters to create a powerful tool for performance analysis and op- 
timization of OpenMP applications. Performance data collected with OMPtrace 
is used as input to the Paraver visualization tool for detailed analysis of the 
parallel behavior of applications. 

The usefulness of OMPtrace and the power of Paraver for tuning and opti- 
mizing OpenMP applications was illustrated in a case study with the US DOE 
ASCI SweepSD benchmark. We analyzed the performance of a small problem 
size of SweepSD, running on a single IBM SP node with 16 processors, and val- 
idated the performance observations and code optimizations, running a mixed 
MPI/OpenMP version of the code, with a larger problem size, on 12 IBM SP 
Nodes with 8 processors each. 

The performance of the original OpenMP version was three times slower 
than the MPI version when running on 12 processors of a single IBM SP node, 
but when running the optimized version with the larger problem size on 96 
processors, the mixed MPI/OpenMP version performed slightly better than the 
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pure MPI version for all three combinations of MPI tasks and OpenMP threads 
used (48 task and 2 threads, 24 tasks and 4 threads, and 12 tasks and 8 threads). 

We notice that the two dimensional MPI parallelization and the correspond- 
ing combinations of possible decompositions had an important effect on the 
performance of the MPI program. In OpenMP there is only one dimensional 
parallelization and it would be interesting to experiment with nested OpenMP 
parallelism. 

Finally, we observe that there are still several issues, such as memory man- 
agement and scheduling conflicts, that remain as challenges for optimization of 
the mixed MPI/ OpenMP application. 



References 

1. B. R. Buck and J. K. Hollingsworth. An API for Runtime Code Patching. In Journal 
of High Performance Computing Applications, 14(4):317-329, Winter 2000. 

2. L. DeRose and T. H. Hoover Jr. and J. K. Hollingsworth The Dynamic Probe Class 
Library - An Infrastructure for Developing Instrumentation for Performance Tools. 
In Proceedings of 2001 International Parallel and Distributed Processing Symposium, 
April 2001. 

3. European Center for Parallelism of Barcelona (CEPBA). Paraver - Parallel Program 
Visualization and Analysis Tool - Reference Manual, November 2000. 

http:/ /www. cepba.upc.es/paraver. 

4. K. R. Koch, R. S. Baker, R. E. Alcouffe. Solution of the First-Order Form of the 3-D 
Discrete Ordinates Equation on a Massively Parallel Processor. In Trans. Amer. 
Nuc. Soc. 65(198), 1992. 

5. B. P. Miller, M. D. Callaghan, J. M. Cargihe, J. K. Hollingsworth, R. B. Irvin, K. L. 
Karavanic, K. Kunchithapadam, and T. Newhall. The Paradyn Parallel Performance 
Measurement Tools. In IEEE Computer, 28(ll):37-46, November 1995. 




A Comparison of Scalable Labeling Schemes for 
Detecting Races in OpenMP Programs* * 



So-Hee Park, Mi- Young Park, and Yong-Kee Jun** 

Dept, of Computer Science, Gyeongsang National University 
Chinju, 660-701 South Korea 
{shpark.park, jun}@race . gsnu. ac .kr 



Abstract. Detecting races is important for debugging shared-memory 
parallel programs, because the races result in unintended nondeterminis- 
tic executions of the program. On-the-fly technique to detect races uses 
a scalable labeling scheme which generates concurrency information of 
parallel threads without any globally-shared data structure. Two effi- 
cient schemes of scalable labeling, BD Labeling and NR Labeling, show 
the similar complexities in space and time, but their actual efficiencies 
have been compared empirically in no literature to the best of our knowl- 
edge. In this paper, we empirically compare these two labeling schemes 
by monitoring a set of OpenMP kernel programs with nested parallelism. 
The empirical results show that NR Labeling is more efficient than BD 
Labeling by at least 1.5 times in generating the thread labels, and by at 
least 3.5 times in using the labels to detect races in the kernel programs. 



1 Introduction 



A race is a pair of instructions which accesses a shared variable with at least one 
write access without coordination in an execution of the program. Detecting the 
races is important for debugging such a kind of parallel programs as OpenMP 
programs [2CI, because the races result in unintended non-deterministic execu- 
tions of the program. On-the-fly race detection [1 13l4lfi| instruments a debugged 
program and monitors an execution instance of the program to report races 
which occur during the monitored execution. To detect any race which involves 
the current access in an execution, this technique uses a race detection protocol 
which determines the logical concurrency between the current access and the 
previous conflicting accesses, and then maintains an access history of the shared 
variable for the subsequent race detection. For the concurrency information to be 
used in the protocol, an on-line labeling scheme generates the logical concurrency 
information, called a label, for every thread in the execution. 

Various on-line labeling schemes mm have been reported in the litera- 
tures, but some labeling schemes jH| use a centralized data structure which is 
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globally-shared among the threads and then may incur serious bottleneck in gen- 
erating the thread labels. Among the scalable schemes, OS Labeling is less 
efficient than the other scalable schemes m, because it increases the storage 
space of each label in the access histories in proportion to the nesting depth of the 
monitored parallelism and then requires an additional time overhead of garbage 
collection to save the space for the access histories. The other two scalable label- 
ing schemes, BD Labeling P and NR Labeling P, generate a constant-sized label 
for each entry of access histories, and show the similar complexities each other 
in space and time, but their actual efficiencies have been compared empirically 
in no literature to the best of our knowledge. 

This paper compares the actual efficiencies of these two scalable schemes: 
BD Labeling and NR Labeling. BD Labeling generates a label on every event 
to fork or join or coordinate threads, because its basic idea was originated from 
monitoring distributed-memory programs with nested parallelism. To monitor 
an execution of OpenMP program, it is sufficient for each logical thread to have 
just one label, such as in NR Labeling. Thus we modify BD Labeling to be more 
efficient by making it generate only one label for each thread, and call it Thread- 
based-BD (T-BD) Labeling. We empirically compare T-BD Labeling with NR 
Labeling using a set of OpenMP kernel programs with nested parallelism. The 
kernel benchmark programs are monitored sequentially, because the comparison 
requires to measure the net execution time of labeling scheme in a monitored 
execution of program. The empirical results show that NR Labeling is still more 
efficient than T-BD Labeling in every aspect of time efficiencies in on-the-fly 
race detection. 

After describing first the background information on this work in section 2, 
we introduce the two scalable labeling schemes, T-BD Labeling and NR Labeling, 
with the corresponding examples in the following two sections. Then we explain 
our experimentation which includes the race detection protocol we used, the 
kernel programs, and the measurement strategy before analyzing the results. 
Our work is concluded in the final section. 



2 Background 

This section describes the OpenMP nested parallelism considered in this work 
using a directed acyclic graph called Partial Order Execution Graph (POEG) 
0. POEG captures the happened-before 0 relationship and imposes a partial 
order on the set of threads that make up an execution instance. Then we discuss 
the labeling schemes to generate information of such concurrency relationship 
to detect races in an execution of parallel programs. 

2.1 OpenMP Program 

In this paper, we consider OpenMP programs with nested parallel loops without 
coordination operation between threads. If there is no other loop contained in a 
loop body, the loop is called an innermost loop. Otherwise, it is called an outer 
loop. In a nested do-loop D, an individual loop can be enclosed by many outer 
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C$0MP 


PARALLEL DO 






DO 


I = 1, 2 






IF 


(I .EQ. 2) 
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END 


PARALLEL 
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END 


DO 




C$0MP 


END 


PARALLEL 


DO 




Fig. 1. An OpenMP Parallel Program and its POEG 



loops. The nesting level of an individual loop is equal to one plus the number of 
the enclosing outer loops. The nesting depth of D is the maximum nesting level 
of loops in Z). In a perfectly or one-way nested loop of nesting depth TV, there 
is exactly one loop at each nesting level i, (i = 1, 2, • • • , TV). A loop is multi-way 
or m-way nested if there exist m disjoint loops in a nesting level, {m >1). For 
example. Figure E shows a two-way nested loop of nesting depth three. If we 
remove the loop indexed by J1 in the figure, the program becomes a one-way 
nested loop of nesting depth three. Let li denote the loop index of a loop 13^, 
Li and Ui denote the lower and upper bound of li respectively, and Ai denote 
the increment of A. A loop Di is normalized if the values of both Li and Ai are 
one. Ai is optional when it is equal to one. To make our presentation simple, we 
assume that all parallel loops are normalized loops. Figure^shows a normalized 
loop with index J1 for which the lower bound is one, the upper bound is I+l, 
and the increment is one. 

In an execution of OpenMP program, more than one thread can be forked 
to share a work at a ‘C$0MP PARALLEL DO’ directive, and can be joined at the 
corresponding ‘C$0MP END PARALLEL DO’ directive. Such fork and join opera- 
tions are called thread operations. Figure ^ shows an OpenMP Fortran program 
where four parallel loops are specified with the corresponding pairs of directives. 
The concurrency relationship among threads is represented by a directed acyclic 
graph, called Partial Order Execution Graph (POEG) 0. A vertex of POEG 
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represents a thread operation, and an arc started from a vertex represents a 
logical thread started from the corresponding thread operation. Since the graph 
captures the happened-before relationship |n|, it presents a partial order on a set 
of the logical threads that make up an execution instance of a parallel program. 
For example, the POEG in Figure ^ shows a partial order on the threads in an 
execution instance of the two-way nested loop. 

Ordering threads does not depend on the number and the relative execution 
speeds of processors that executes the program. A thread G happened before a 
thread tf,, if there exists a path from ta to in the graph. In this case, is 
an ancestor thread of tb, and tb is a descendant thread of ta- A thread ta {h) is 
a parent (child) thread of tb (ta), if there exists no thread G such that ta (tb) 
happened before G which happened before tb (ta)- There exists two concurrent 
threads if and only if there exists no path between the two threads in the POEG. 
For example, consider the execution instance of POEG shown in Figure^ In this 
figure, the thread numbered three, denoted by t^, happened before the thread 
tg, because there exist a path from ts to tg. The thread tg is concurrent with ts, 
because there exist no path between the two threads. 



2.2 Online Labeling Schemes 



On-the-fly race detection instruments a debugged program and monitors an exe- 
cution instance of the program to report races which occur during the monitored 
execution. To detect any race which involves the current access in an execution, 
this technique uses a race detection protocol which determines the logical con- 
currency between the current access and the previous conflicting accesses, and 
then maintains an access history of the shared variable for the subsequent race 
detection. For the concurrency information to be used in the protocol, an on-line 
labeling scheme generates the logical concurrency information, called a label, for 
every thread in the execution. 

Various on-line labeling schemes [lKII4lfi| have been reported in the litera- 
tures, but some labeling schemes such as Task Recycling use a centralized 
data structure which is globally-shared among the threads and then may incur 
serious bottleneck in generating the thread labels. Among the scalable schemes, 
OS Labeling is less efficient than the other scalable schemes am, because it 
increases the storage space of each label in the access histories in proportion to 
the nesting depth of the monitored parallelism and then requires an additional 
time overhead of garbage collection to save the space for the access histories. 
The other two scalable labeling schemes, BD Labeling P and NR Labeling Pj, 
generate a constant-sized label for each entry of access histories, and show the 
similar complexities each other in space and time, but their actual efficiencies 
have been compared empirically in no literature to the best of our knowledge. 

This paper compares the actual efficiencies of these two scalable schemes: 
BD Labeling and NR Labeling. BD Labeling generates a label on every event to 
fork or coordinate threads in addition to every logical thread, because its basic 
idea was originated from monitoring distributed-memory programs with nested 
parallelism. To monitor an execution of shared-memory OpenMP program, it is 
sufficient for each logical thread in POEG to have just one label, such as in NR 
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Fig. 2. A T-BD Labeling 



Labeling. For example, we generate only one label for a logical thread which is 
started from a join operation and terminated at a fork operation. BD Labeling 
however generates two labels for this thread: one is generated for the thread itself 
immediately after the join operation is completed and the other is generated 
for the fork operation. Thus we modify BD Labeling to be more efficient by 
making it generate only one label for each thread, and call it Thread-hased-BD 
(T-BD) Labeling. This thread-based labeling scheme can also be applied to the 
coordination operations between threads in a similar way. 



3 T-BD Labeling Scheme 

In T-BD Labeling, each logical thread has concurrency information which con- 
sists of two main components: a BD label and a list structure L, denoted by 
BD & L altogether. A BD label identifies a thread in the nesting level n with 
two subcomponents: BDn = (&„, d„), where h and d are integers called a breadth 
and a depth of the thread, respectively. The list L„ of is a list of triples. 
The t-th entry of L„, denoted L„(i), consists of three integers, {ti-\,Ti,di-\), 
for the most recent thread ti-\ in nesting level i — 1 such that U-i is ordered 
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0 T-BD-Init() 

1 i:= 0 ; 

2 (b,d)-( 0 ,l); 

3 L(l) := (1,0,1); 

4 End T-BD-Init 

0 T-BD-Fork() 

1 i := ip + 1-, 

2 for a := 1 to i do 

3 L{a) ~ Lp{a)\ 

4 endfor 

5 L{i, T) := U; 

6 

7 for k 1 to i — \ do 

8 g--gxL{k,T)- 

9 endfor 

10 b :=bp + g- 

11 d := dp + 1 ; 

12 L(i + 1 ) := (t, 0 ,d); 

13 d' ■- d; 

14 End T-BD-Fork 

Fig. 3. T-BD 



0 T-BD-Join() 

1 d \= d' + Ij 

2 L{i + 1, d) ~ d; 

3 dp max{dp,d}; 

4 End T-BD-Join 

0 T-BD-Ordered(fe, d) 

1 k--l- 

2 while L{k, T) 7 ^ 0 do 

3 tj) := b mod L{k, T) + 1; 

4 if t_b 7 ^ L{k + 1, t) then 

5 return false; 

6 elseif d < L{k + 1, d) then 

7 return true; 

8 endif 

9 b--b/L{k,T)- 

10 fc:=fc + l; 

11 end while 

12 return true; 

13 End T-BD-Ordered 

eling Algorithms 



with the thread where i = 1, 2, • • • , n + 1. Here, ti is the loop index of thread 
which is less than or equal to Ti which is the number of siblings of the thread 
ti- And the thread depth di is the Lamport timestamp |S| which means one plus 
the number of threads that happened before the thread U in the critical path 
from the initial thread in the POEG. The breadth bi of a thread ti is defined 
as bi = bi_i + {ti — 1) 111=1 Li{k,T), where 5o = 0 and Li{k,T) stands for Ti 
in Li{k). From this equation, we have b\ = t\ — 1, which is the remainder of 
dividing bi with Li{l,T). 

For example, consider a thread t^ in Figure El whose BD label is (3,6) in 
the third nesting level. The breadth of this thread 63 is three, because bo = 0, 
h = {bo + 1) = 1, 62 = (&i + 1 X 2) = 3, and 63 = (62 + 0) = 3. ^3(4, T) of this 
thread is zero, because the thread does not know the number of threads to be 
forked next. This entry is only for initializing t^ and do in ^3(4). Whenever a 
thread in the nesting level i is forked, the on-line algorithm computes {bi,di,Ti} 
to generate its BD label, Li{i, T), and Li{i -b 1) but Li{i -b 1, T). In this process, 
a constant number of computations is sufficient except the case of computing bi 
whose number of computations increase as large as the size of Li which increases 
as large as the nesting depth. 

When we check concurrency relationship between any two threads, ti and tj, 
only one of their Li and Lj is sufficient to do the job. Consider the case of using 
Lj. By the definition of the breadth of a thread, a thread tm can be assumed 
as U, where m = 1, 2, • • • , i, from obtaining one plus the remainder in dividing 
bi by Lj{k,T), where fc = 1,2, • • • ,j. We compare tm with tk in Lj{k -b l,t). If 
tm is equal to tk, and di is less than or equal to dk in Lj{k + 1, d), the thread ti 
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happened before tj. For example, consider the two threads in FigureEl U = (5, 3) 
and tj = (3,6). In case we use the list structure L 3 of tj, we can compute tm 
using bi = 5 and Lj{k, T), where k = 1, • • • , 3, resulting in tm = 2, where m = 1. 
Therefore, the thread ti happened before tj, because the computed ti = 2 with 
bi = 5 is equal to the stored ti = 2 in L^{2, t), and di = 3 is less than di = 4 in 
Ls{2,d). 

We implemented T-BD Labeling with four algorithms shown in Figure El 
T-BD-Init(), T-BD-Fork(), T-BD-Join(), and T-BD-Ordered(). Among them, 
T-BD-Ordered( 6 , d) checks the logical concurrency between the current thread 
and another thread whose label is {b,d). The other three algorithms generate 
concurrency information including BD label: T-BD-Init() for an initial thread, 
T-BD-Fork() for a thread started from fork operation, and T-BD-Join() for a 
thread started from join operation. In the algorithms, each data structure with 
a subscript p such as Lp means the corresponding data structure of the ancestor 
thread which forked the current thread. The depth variable d' is a mirror variable 
which is shared locally by the siblings of the current thread to help maintain d 
in the thread. And U denotes the upper bound of the current loop index. 



4 NR Labeling Scheme 

In NR Labeling, each thread has concurrency information which consists of three 
main components: the one-way depth the one-way region OR, and the one-way 
history OH of the thread, denoted by [^, OR\ cc OH altogether. The one-way 
depth ^ is one plus the nesting level of the most recent one-way root of the 
thread. A one-way root of a thread t is the most recently joined ancestor of t in 
each nesting level which is higher than or equal to the nesting level of t. Here 
the set of joined ancestors of a thread are assumed to include the initial thread 
and the thread itself if it is a joined thread. The one-way region OR of a thread 
is a pair of a join counter A and a nest region (a,/3), denoted by [A, (a,/?)] 
altogether. The join counter A of a thread is the number of the joined ancestors 
of the thread in the critical path from the initial thread. The nest region consists 
of two integers {a, (3) which mean a range of number space divided by each fork 
operation and concatenated by each join operation. This one-way region of a 
thread can be used as the thread identifier, and stored in access histories as an 
NR label of the thread for detecting races. The one-way history OH of a thread 
is an ordered list of the one-way roots of the thread which are represened by 
their one-way regions in an ascending order of their join counters. 

Figure E| shows an example of NR Labeling. The nest region of the initial 
thread is (1,50) just for readability, although it is initialized with {l,maxint) 
in general where maxint is the maximum integer that can be represented in a 
machine. This initial nest region is divided in two regions in the child threads of 
the initial thread into (1,25) and (26,50), and concatenated back to the initial 
region in the last thread. For an example of labeled thread, consider a thread t 
in the lower part of the figure whose one-way region is ORt = [4, (26,50)]. The 
join counter of t. At, is four, because t has four joined ancestors in the critical 
path from the initial thread: the threads whose one-way regions are [1,(1, 50)] 
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of the initial thread, [2, (26, 50)], [3, (26, 37)] or [3, (38, 50)], and [4, (26, 50)] of t. 
Therefore the one-way history of t, OHt, is the ordered list of two one-way roots 
of t: [1,(1, 50)] in the nesting level of zero, and [4,(26,50)] of t in the nesting 
level of one. And the one-way depth of t, is two, because t itself is the most 
recent one-way root of t and the nesting level of t is one. For the details of NR 
Labeling, refer to the original paper 0. 

To compare the concurrency relationship between any two threads, U and 
tj, we need ORi, ORj, and only one of OHi and OHj. Here we consider the 
case of using OHj. For this case, we first have to figure out the nearest one-way 
root of tj to ti, which is a thread t^ in OHj such that A^, is the smallest join 
counter in OHj that is greater than A^. Then, the thread ti happened before tj, 
if it satisfies Xi < Xx < Xj and the nest region (ai,Pi) overlaps with {ax,(3x) in 
OHj, or it satisfies Xi = Xj and {ai,j3i) overlaps with {aj,/3j). Here, we can use 
a binary search to find out tx in OHj , because an OH is a list ordered with the 
join counter as the search key. For example, consider the two threads, ta and R 
in the Figure® whose labels are ORa = [1, (26, 33)] and ORb = [3, (38, 50)] with 
its OHb- In this case, the nearest one-way root of R to ta is the thread tc whose 
one-way region ORc = [2, (26, 50)]. Therefore, ta and R are ordered each other. 
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0 NR-Init() 

1 (q,/3} := (l,maxint)-, 

2 A := 1; 

3 OH{l)~[X,{a,P)]- 

4 C:=i; 

5 End NR-Init 

0 NR-Fork() 

1 stride := {(Up — + l)/?7; 

2 a := + (7 — 1) X stride', 

3 if (i < U) then 

4 (3 := a + stride — 1; 

5 else (3 '■= Pp', 

6 endif 

7 A — Ap; 

8 for a 1 to ^ do 

9 OH{a) '.= OHp{a)' 

10 endfor 

11 C-Cp; 

12 A' := A; 

13 End NR-Fork 



0 NR-Join() 

1 A := A' + 1; 

2 if (077(e, (a,/3>) = (o,/l)) then 

3 077(5, A) := A; 

4 else 5 5 + Ij 

5 077(5) := [A,(q,/3>]; 

6 endif 

7 Ap := max{Ap, A}; 

8 End NR- Join 

0 NR-Ordered(077i) 

1 if (Ai < A) then 

2 OR := NR_BinarySearch{OH, 

3 if (ai < OR{P) and OR{a) < Pi) then 

4 return true; 

5 endif 

6 elseif (o; < P and a < Pi) then 

7 return true; 

8 endif 

9 return false; 

10 End NR-Ordered 



Fig. 5. NR Labeling Algorithms 



because the three join counters satisfy (1 < 2 < 3) and the nest region (26,33) 
overlaps with (26,50). 

We implemented NR Labeling with four algorithms shown in Figure S NR- 
Init(), NR-Fork(), NR-Join(), and NR-Ordered(). NR-0rdered(07?i) checks the 
logical concurrency between the current thread and another thread whose label 
is ORi- The other three algorithms generate concurrency information including 
NR label: NR-Init() for an initial thread, NR-Fork() for a thread started from 
fork operation, and NR-Join() for a thread started from join operation. In the 
algorithms, each data structure with a subscript p such as OHp means the corre- 
sponding data structure of the ancestor thread which forked the current thread. 
The counter variable X' is a mirror variable which is shared locally by the sib- 
lings of the current thread to help maintain A in the thread. And U denotes the 
upper bound of the current loop index 7. 



5 Empirical Comparison 

In this section, we empirically compare T-BD Labeling with NR Labeling using 
a set of OpenMP kernel programs with nested parallelism. For race detection 
protocol that uses the two labeling schemes, we chose Mellor-Crummey’s protocol 
0 which is efficient to monitor a parallel program with nested parallelism and no 
other inter-thread coordination. This protocol is efficient, because it requires only 
three entries in an access history of a shared variable. Figure El shows the Mellor- 
Crummey’s protocol implemented for this work, where we use AH{X) to denote 
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0 CheckRead(X, Label) 

1 if ^Ordered{AH{X,W)) then 

2 report a race; 

3 endif 

4 if AH (X,Rl) or 

LeftOf{AH{X,RL)) or 
Ordered{AH{X, Rl)) then 

5 AH {X, Rl) ■- Label- 

6 endif 

7 if AH(X, Rr) = 0 or 

-^LeftOf{AH{X,RR)) or 
Ordered{AH{X, Rr)) then 

8 AH{X,Rr) ■- Label-, 

9 endif 

10 End CheckRead 



0 CheckWrite(X, LafeeZ) 

1 if ^Ordered{ AH {X,W)) then 

2 report a race; 

3 endif 

4 if -^Ordered(AH{X, Rl)) or 

-iOrdered{AH(X, Rr)) then 

5 report a race; 

6 endif 

7 AH{X, W) -- Label-, 

8 End CheckWrite 



Fig. 6. The Mellor-Crummey Protocol 



the access history of shared variable X. AH{X) consists of three entries: one 
entry for write access, AH{X, W), and two entries for concurrent read accesses, 
AH{X, Rl) and AH{X,Rfi). AH{X, Rl) and AH{X,Rn) denote the leftmost 
and rightmost thread respectively that are distinguished using the left-of relation 
0. The left-of relation establishes a canonical ordering of two relative threads 
with respect to their thread indexes of a loop from which ancestors of the two 
threads are forked. This protocol therefore especially requires such a labeling 
scheme that can evaluate the left-of relation for any two threads. FigureQ shows 
two algorithms to evaluate the left-of relation for the two labeling schemes. 

The efficiency of on-the-fly race detection is classified into space and time 
complexity. Space complexity consists of the space to store access histories for 
all shared variables and the space to store thread labels of simultaneously active 
threads. Time complexity consists of the time to generate a label for every thread 
and to determine a race and maintain the access history in every access to a 
shared variable. Here, the time to determine a race includes the time to determine 
if the two threads of the conflicting accesses are ordered. Using the Mellor- 
Crummey protocol, both T-BD Labeling and NR Labeling show the same worst- 
case complexity of 0{V+ NT) in space to store constant-sized labels in all access 
histories and to store all the concurrency information of threads each of which 
size is 0{N), where V is the number of monitored shared variables, and N and T 
are the nesting depth and the maximum parallelism of the monitored program, 
respectively. Both T-BD Labeling and NR Labeling create each thread label in 
time of 0{N) in the worst-case, but they are different in time to determine the 
logical concurrency between any two threads, which are 0{N) and 0(log2 N) 
for T-BD Labeling and NR Labeling respectively. They are also different in the 
worst-case complexities of time to determine the left-of relation between any 
two threads, which are 0{N) and 0(1) for T-BD Labeling and NR Labeling 
respectively. This amount of difference in time, however, can be trivial, because 
N is typically as very small as nine or ten even in some serious large-scale 
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0 T-BD-LeftOf(fo, d) 

1 fc — 1; 

2 while L{k, T) ^ 0 do 

3 t_b := b mod L{k, T) + 1; 

4 if tJ) > L{k + 1, t) then 

5 return true; 

6 endif 

7 b--b/L{k,T)- 

8 fc := fe + 1; 

9 endwhile 

10 return false; 

11 End T-BD-LeftOf 



0 NR-LefOf(Oi7i) 

1 if P < Ui then 

2 return true; 

3 endif 

4 return false; 

5 End NR-LeftOf 



Fig. 7. The Left-Of Algorithms of T-BD and NR Labeling 



applications. We use Mellor-Crummey’s detection protocol because, as shown 
in Figure El the protocol requires a constant time to detect races in each access 
except these two kinds of efficiencies from labeling scheme: the times to determine 
the logical concurrency and the left-of relation between any pair of threads. 

As of writing this paper, we do not have any published OpenMP benchmark 
program that is appropriate to evaluate labeling schemes for detecting races in 
OpenMP programs. Omni-OpenMP team jSj developed some OpenMP versions 
of NAS benchmark programs, but they are neither appropriate to evaluate on- 
the-fly race detection tools nor have nested parallelism considered in this work. 
To measure the actual efficiencies of T-BD Labeling and NR Labeling, we use a 
set of OpenMP kernel programs which are compiled by Omni OpenMP Compiler 
Version 1.2.0 jH|, and run on Intel Pentium-Ill PC under Linux Red Hat Version 
6.2. We have two sets of OpenMP-Fortran kernel programs: one set of programs 
with no nested loops but one-way nested loops and the other set of programs 
with no nested loops but two-way nested loops. Each kernel program executes 
ten accesses to one shared variable in every thread. The characteristics of our 
kernel programs is shown in the corresponding columns of Table E and 0 the 
nesting depth N = {3, 4, 5}, the number of threads forked in every parallel loop 
t = {10, 15,20}, the names of programs, and the original execution time that is 
measured in unmonitored phase. 

We instrumented the kernel programs for the two kinds of algorithms: the 
labeling schemes to generate concurrency information on every thread in an exe- 
cution of the program, and the race detection protocol on every access to detect 
races occurred in the execution instance. These algorithms are implemented as 
a set of library routines written in C, and linked to the OpenMP-Fortran ker- 
nel programs. We put the corresponding labeling functions immediately after 
each ‘C$0MP END PARALLEL DO’ directive and each DO statement which follows 
‘C$DMP PARALLEL DO’ directive. Each protocol function is located immediately 
after every statement that has the corresponding type of accesses to read or write 
to a shared variable. The data structures for the algorithms are instrumented 
in two types: a globally-shared access history for the protocol functions at the 
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Table 1. The Result using One-way Nested Kernel Programs (unit: second) 



Kernel 


Labeling 


Protocol 


Total 


N 


t 


Version 


Original 


T-BD 


NR 


T-BD 


NR 


T-BD 


NR 


3 


10 


3 P 10 


0.01 


0.00 


0.00 


0.04 


0.01 


0.06 


0.02 


15 


3 P 1 S 


0.01 


0.01 


0.01 


0.13 


0.03 


0.15 


0.05 


20 


3 P 20 


0.01 


0.02 


0.01 


0.31 


0.09 


0.34 


0.11 


4 


10 


4 P 10 


0.01 


0.04 


0.02 


0.44 


0.11 


0.49 


0.14 


15 


4 P 1 S 


0.03 


0.15 


0.10 


2.12 


0.55 


2.30 


0.68 


20 


4 P 20 


0.07 


0.46 


0.31 


6.48 


1.69 


7.01 


2.07 


5 


10 


5 P 10 


0.06 


0.83 


0.20 


4.85 


1.17 


5.74 


1.43 


15 


sPis 


0.31 


5.98 


1.53 


34.57 


8.22 


40.86 


10.06 


20 


5 P 20 


1.13 


24.53 


6.36 


140.95 


33.45 


166.61 


40.94 



Table 2. The Result using Two-way Nested Kernel Programs (unit: second) 



Kernel 


Labeling 


Protocol 


Total 


N 


t 


Version 


Original 


T-BD 


NR 


T-BD 


NR 


T-BD 


NR 


3 


10 


3 P 10 


0.01 


0.03 


0.02 


0.33 


0.08 


0.37 


0.11 


15 


3 P 15 


0.02 


0.07 


0.05 


1.08 


0.28 


1.17 


0.35 


20 


3 P 20 


0.03 


0.19 


0.11 


2.49 


0.70 


2.71 


0.84 


4 


10 


4 P 10 


0.07 


0.48 


0.33 


7.23 


1.84 


7.78 


2.24 


15 


4 P 15 


0.32 


2.34 


1.60 


34.87 


8.47 


37.53 


10.39 


20 


4 P 20 


0.89 


7.35 


5.00 


107.57 


26.11 


115.81 


32.00 


5 


10 


5 P 10 


1.45 


9.53 


6.45 


161.42 


35.57 


172.40 


43.47 


15 


5 P 15 


9.02 


71.02 


48.04 


1170.14 


256.03 


1250.18 


313.09 


20 


5 P 20 


35.22 


295.39 


199.49 


4818.61 


1050.15 


5149.22 


1284.86 



beginning of program, and the private variables for labeling functions which is 
instrumented as a PRIVATE clause in every ‘C$DMP END PARALLEL DO’ directive. 

To compare the execution times of the two labeling schemes, we monitored 
the programs sequentially, because the comparison requires to measure the net 
execution time of labeling algorithm in the monitored execution. For the sequen- 
tial monitoring, we set an OpenMP environment variable to execute the kernel 
program in one thread. Using the time command, each kernel program is mea- 
sured in three phases: the unmonitored phase to measure the original execution 
time of the program, the partially-monitored phase to measure the time to ex- 
ecute only the labeling functions without instrumenting the protocol functions, 
and the fully-monitored phase to measure the total time to detect races. Then we 
calculated the protocol time by subtracting the partially-monitored time from 
the fully-monitored time. 

The empirical results in TableOIandOshow that NR Labeling is more efficient 
than T-BD Labeling by at least 1.5 times in generating labels, by at least 3.5 
times in using the labels to detect races, and by at least 3.0 times in total 
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monitoring with the two sets of kernel programs. We argue that the differences 
reffect the time complexities of algorithms. The difference in the labeling times 
stems from the number of for-loops which consume 0{N) time in the worst-case 
in each labeling function for a thread started from fork operation: two loops in 
T-BD-Fork(), and one loop in NR-Fork(). And the difference in protocol time 
stems from the while-loops which compares the logical concurrency and the 
left-of relation between two threads. T-BD-Ordered() has a loop that consumes 
0{N) time, but NR-Ordered() has a loop in its binary search that consumes only 
0(log2 N), in the worst-case. And, T-BD-LeftOf() also has a loop that consumes 
0{N) time in the worst-case, but NR-LeftOf() consumes only a constant time. 

6 Conclusion 

To compare the time efficiency of two scalable labeling schemes, BD Labeling 
and NR Labeling, we modified BD Labeling to be more efficient by making 
it generate only one label for each thread, and call it Thread-based- BD (T-BD) 
Labeling. Then, we empirically compared the actual efficiencies of these two scal- 
able schemes using a set of OpenMP kernel programs with nested parallelism. 
We monitored the kernel benchmark programs sequentially, because the com- 
parison requires to measure the net execution time of each labeling scheme in 
a monitored execution. The empirical results show that NR Labeling is more 
efficient than T-BD Labeling by at least 1.5 times in generating labels, by at 
least 3.5 times in using the labels to detect races, and by at least 3.0 times in 
total monitoring with the kernel programs. 
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Abstract. Debugging nondeterministic parallel programs is accepted as 
one of the harder problems of software engineering. One source of nonde- 
terminsm are semaphores used to establish and control critical sections. 
As some threads compete for a semaphore, the point of time by which 
a specific thread locks a specific semaphore is not determined and may 
change during subsequent executions. A technique for debugging pro- 
grams containing such race conditions is event manipulation, which al- 
lows the user to investigate the effects of different ordering in accesses to 
semaphores during subsequent re-executions. This allows to detect hid- 
den errors, that may otherwise occur only sporadically. The technique 
described in this paper targets at OpenMP programs, and is therefore 
the first approach to perform event manipulation on shared memory ap- 
plications. 



1 Introduction 

Shared memory parallel programming is an important paradigm for application 
development on high performance computing systems. Firstly, with OpenMP ^ 
there exists a standard allowing to write shared memory programs which are 
portable across most existing parallel architectures. Secondly, the growing spread 
of low-cost computing clusters constructed of commodity SMP nodes reaches 
more and more customers. Thirdly, even large-scale massively parallel machines 
like ASCI WhitcQ incorporate shared memory programming by pushing a mixed 
mode between MPI and OpenMP. Due to these reasons, it seems justified and 
necessary to provide powerful shared memory program development tools for 
software engineers. 

As commercial available OpenMP debugging tools offer no functionality to 
deal with nondeterminism, our intention was to develop a method to handle 
nondeterministic OpenMP programs. In this paper, we describe a technique for 
supporting the user during debugging of nondeterministic OpenMP programs. 
Nondeterminism is observed, whenever subsequent program executions reveal 
different results although the same input is provided [Zj. At present, we consider 

^ http : //www. llnl . gov/asci/platf orms/white/ 

R. Eigenmann and M.J. Voss (Eds.): WOMPAT 2001, LNCS 2104, pp. 81-^^ 2001. 
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the order of accesses to shared data spaces, which are encapsulated in OpenMP 
lock and unlock operations. During an initial program execution, the observed 
access order is stored in tracefiles. Based on the stored data, an event graph 
is constructed, and the user can choose a different access order by applying 
event manipulation. Of course, only changes that are actually possible can be 
applied. The effects of these changes and their inffuence on a program’s results 
can afterwards be evaluated by initiating an artificial program re-execution. 

The next section offers some basic definitions on nondeterminism in par- 
allel programs. Afterwards, the basics of the event manipulation strategy are 
introduced, before describing each step of this approach with an example of a 
nondeterministic OpenMP program. 



2 Nondeterminism and Race Conditions 

In parallel programs, nondeterministic program behavior is introduced either 
through special functions (e.g. the random number generator) or as a side effect 
of process interaction P|. For process interaction in shared memory programs, 
two types of race conditions can be distinguished Uni: 

— data races: if several threads can access shared memory and the accesses 
depend on the relative speed of the threads, different results can occur in 
successive program runs. 

— synchronization races: when several threads perform synchronization oper- 
ations on the same object, the relative speed of the threads may lead to 
different access ordering. 

Data races are often not intended by the programmer, and therefore represent 
errors per se that can be removed immediately m- On contrary, synchroniza- 
tion races occur due to the user’s placement of synchronization operations. An 
example is offered by OpenMP with the possibility of allocating semaphores 
and using corresponding lock and unlock operations. Figure ^ shows a synchro- 
nization race between two threads (Tl, T2). The brackets indicate the lock and 
unlock operations for a selected semaphore operation, which is used to determine 
the value of variable A. As shown in Figure QJ two possible execution orders can 
occur, with different results in variable A depending on which process is allowed 
to perform its lock operation first. 

Due to such race conditions, several difficulties occur during debugging 

— The irreproducibility effect: given a nondeterministic program it cannot be 
guaranteed, that an observed execution can be reproduced during debugging. 

— The completeness problem: obtaining all possible results of a nondeterminis- 
tic program may be impossible, no matter how many executions are initiated. 

— The probe effect: due to the overhead of debugging, a nondeterministic pro- 
gram may yield different results compared to an execution without debugger. 
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Execution 1 
T1 T2 





Fig. 1. Different execution sequence due to synchronization race. 



The irreproducibility effect is addressed with so-called ” record&replay” tech- 
niques, which operate in two phases 0. After instrumenting a program, critical 
events are recorded during an initial executions. The recorded data can after- 
wards be used during arbitrary replay phases to establish the same order at 
the critical events, and thus provides an equivalent execution |S| . Record&replay 
approaches for shared memory parallel programs are discussed in j 1 1 118) . 

In contrast to the irreproducibility effect, there are currently no solutions 
addressing the completeness problem and only a few approaches addressing the 
probe effect. In both cases, the main problem is to address the changes in access 
ordering. This problem is approached by our event manipulation approach as 
described below. 

3 Event Manipulation Strategy 

In order to describe our event manipulation approach, we should briefly offer 
some definitions concerning the underlying event graph model 

Definition 1. An event graph is a directed graph G = (A, — where E is the 
non-empty set of events ofG, while — >■ is the ’’happened before” -relation con- 

necting events, such that — >■ means, that there is an edge from event to 
event in G with the ’’tail” at event and the ’’head” at event e^. 

The events in such a graph are state changes occurring during program exe- 
cution. An event definition is given in jOj: 

Definition 2. An event e^ is defined as an action without duration that takes 
place at a specific point in time i and changes the state of a process/thread p. 

The relation connecting these events is the so-called happened before relation, 
which is often used to establish a connection between corresponding events in 
concurrent systems |S|: 
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Definition 3. The ’’happened before” relation denoted as on a set of events 
in G is the smallest transitive, irrefiexive relation satisfying the following two 
conditions for arbitrary events and ; 

1. If 6p occurs before on the same thread (p =q), then — >■ e^. 

2. If Cp is the source of a communication or synchronization operation on thread 
p, and is the corresponding target operation on thread q, then Cp ^ 

The events in such a graph are state changes occurring during program ex- 
ecution P]. At present, we are only interested in events generated by lock and 
unlock operations, which are responsible for synchronization races as described 
above. 

The data for constructing an event graph model of a specific program run 
is generated during an initial execution. For that reason, a program has to be 
instrumented in order to observe the program’s execution. Afterwards, the user 
can can perform nondeterminism analysis by performing the following steps 0 : 

1. Selection of an arbitrary nondeterministic event 

2. Evaluation of race candidates corresponding to the selected event 

3. Event manipulation by exchanging the selected event with another candidate 

4. Artificial replay to enforce the manipulated execution 

4 Example 

4.1 OpenMP Source Code 

We describe this approach by means of a short OpenMP program example. In 
this example, computation is done by simply increasing an integer variable by 1 
inside the critical section. 

#include <omp.h> 

#include <monitor.h> 

#define SET(s,v) { omp_set_lock(&s) ; v++; omp_unset_lock(&s) ; } 

omp_lock_t si; 
omp_lock_t s2; 
omp_lock_t s3; 

int vl = 0, v2 = 0; 

void main (int argc, char **argv) { 
omp_init_lock(&sl) ; 
omp_init_lock(&s2) ; 
omp_init_lock(&s3) ; 
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Step 1: event selection point-of-exchange set 




Fig. 2. Example event graph for nondeterminism analysis 



> 



#pragma omp parallel 

{ 



if (omp_get_thread_num() == 0) SET(s2,v2); 
else if (omp_get_thread_num() == 1) { 
SET(sl,vl); SET(s2,v2); 

SET(sl,vl) ; 

} 

else if (omp_get_thread_num() == 2) SET(sl,vl); 
else if (omp_get_thread_num() == 3) { 
SET(sl,vl); SET(s2,v2); 

} 



else if (omp_get_thread_num() == 4) { 
SET(sl,vl); SET(s2,v2); 
SET(sl,vl); SET(s2,v2); 



} 



omp_destroy_lock(&sl) 

omp_destroy_lock(&s2) 

omp_destroy_lock(&s3) 



A concrete event graph of an observed execution is shown in Figure |3 As 
events we have chosen the ”omp_set_lock” and ”omp_unsetdock” operations. For 
example, ”S1” indicates an ”omp_setdock” operation on a arbitrary semaphore 
’’semaphore 1”, while ”U1” describes the corresponding ” omp _unset Jock” oper- 
ation p. 
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4.2 Evaluation of Race Candidates 

For this example event graph, we apply the strategy defined above. According to 
step 1, we choose an arbitrary event for a nondeterministic lock operation, e.g. 
event {SI, 9, 3), which means a set-event concerning semaphore 1 on thread 3 at 
logical clock 9. Assuming, that we want to investigate the program’s behavior 
corresponding to the possible choices at this event, we proceed with step 2 of 
our strategy. 

The main idea of step 2 is to identify all the possibilities, that may have 
occurred instead of event (S'!, 9, 3). For our example this means, that we have to 
identify all the lock operations, that may have occurred instead of our selected 
event. This means, that all events occurring concurrently to the selected event 
have to be considered. 

The corresponding set of critical events is called race candidates (rc), which 
can be defined as follows: 

rciel) = {e^} U {min(e:^) | -.(e^ e^VO <q<n,p^q} 

For our example of Figure |3 the following set of race candidates concerning 
event (S'!, 9, 3) is obtained: 

rc((.51, 9, 3)) = {(51,9,3)} U {(51, 15,2), (51, 17, 1), (51, 19,4)} 

Besides that, we need to verify whether there exist other operations, which 
interfere with the order of accesses to SI. In concrete, we need to identify those 
events on other semaphores, for which the following condition holds: 

ej, ^ A ^ 

For our example this means, that we must eliminate element (51, 19,4) from 
the set of race candidates. This event cannot occur instead of event (51,9,3) 
due to its dependency on semaphore S2, which is accessed by thread 3 at event 

(52. 11.3) . The resulting set of race candidates is therefore: 

rc((51, 9, 3)) = {(51, 9, 3), (51, 15, 2), (51, 17, 1)} 

4.3 Event Manipulation 

With the set of race candidates, we can investigate the following question: 

What would have happened, if the nondeterministic choices would have been 
different from what has been observed? 

For our example, we want to investigate, what would have happened, if any 
other member of the set of race candidates would have occurred instead of event 

(51.9.3) . Together with this other member, the point-of-exchange (poe) can be 
defined as follows: 



poe = {Cp, el G rc) 
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thread 0 
thread I 
thread 2 
thread 3 
thread 4 

logicalclock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Sl/Ul ... sel/unset semaphore 1 S2/U2 ... set/unset semaphore 2 




Fig. 3. Event graph after event manipulation and artificial replay 



Therefore, we arbitrarily select event (S'!, 15, 2) to be exchanged with event 
(S'!, 9, 3) by specifying the following point-of-exchange: 

poe = {(51,9,3),(^l,15,2)} 

Step 1 to 3 are shown in Figure El with annotations. With the poe defined, 
we can proceed to step 4. During step 4, the program is re-executed and the 
changes as defined by the poe are enforced. However, since the consequences of 
performing the event manipulation are unknown, we need to distinguish three 
phases during the artificial replay: 

— Before the poe, the program is executed as previously observed. This resem- 
bles the behavior of traditional record&replay mechanisms. 

— At the poe, the order of events is enforced as defined by the user. 

— After the poe, the program needs to be executed without control. This re- 
sembles a program run without a replay mechanism. 



4.4 Cut Placement 

In order to identify the place, where our artificial replay mechanism needs to 
switch from the replay mode to an execution without constraints, we identify a 
so-called cut, which can be defined as follows: 

cut{ep) = poe U {min(ej) | — >■ e^VO < r < n,r p ^ q} 

The cut for our example of Figure Q is shown in Figure El It consists of the 
following events: 



cutiiSl, 9, 3)) = {(^1, 9, 3), (51, 15, 2)} U {(52, 13, 4), (51, 17, 1)} 
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The shaded region indicates the area, where the program’s re-execution is 
carried out without the replay mechanism. No semaphore transitions are viewed 
in this region because the consequences of the event manipulation cannot be 
predicted. This is the reason why no transition from event (C/2,8, 1) on thread 
1 to the next set event on semaphore 2 is drawn. 

The execution, that is revealed by the event manipulation, delivers the results 
with the revised event order. Therefore, it is possible that hidden errors or errors 
occurring only sporadically are detected. 

5 Conclusions and Future Work 

This paper describes the event manipulation approach for shared memory pro- 
grams. At present, we support only the OpenMP set/unset operations, which 
are sufficient to show the usage of our technique. The technique has already been 
implemented in a first prototype tool. The tool is able to instrument arbitrary 
OpenMP programs (by using a C-preprocessor macros) and to generate corre- 
sponding tracefiles. Based on the tracefiles the four steps describe in Section 0 
can be performed by the user. The missing part of this tool is a graphical user 
interface (GUI), which is currently being developed. A first view on this tool will 
be presented at the conference. 

As a first extension we need to add other operations, e.g. the OpenMP test 
operation, which inspects the availability of a semaphore. At this point it is 
certainly necessary to apply the prototype tool to some real-world applications 
in order to identify remaining problems due to nondeterminism. In this con- 
text it is even imaginable, that the complete record&replay strategy as well as 
the event manipulation technique is included in an integrated program devel- 
opment environment. This would allow even inexperienced users to investigate 
nondeterministic parallel programs. 

Another extension of our initial approach is to perform the event manipula- 
tion operation automatically. As shown in our example, we can choose several 
combinations of race candidates for the point-of-exchange. Consequently, there 
is more than one execution that can be derived from an initial program run with 
our event manipulation approach. If we are able to implement our strategy au- 
tomatically, we will also solve the incompleteness problem, as well as problems 
with the probe effect due to event re-ordering. 

Acknowledgements. Contributions to this work have been made by many 
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Abstract. POSIX Threads and OpenMP were used to implement parallelism in 
the nuclear reactor transient analysis code PARCS on multiprocessor SUN and 
SGI workstations. The achievable parallel performance for practical applications 
is compared for each of the code modules using POSIX threads and OpenMP. A 
detailed analysis of the cache misses was performed on the SGI to explain the 
observed performance. Considering the effort required for implementation, the 
directive based standard OpenMP appears to be the preferred choice for parallel 
programming on a shared memory address machine. 



1 Introduction 

The Purdue Advanced Reactor Core Simulator (PARCS) code[l] was developed by 
the United States Regulatory Commission for the safety analysis of nuclear reactors 
operating in the United States. The code solves the time-dependent Boltzmann trans- 
port equation for the neutron flux distribution in the nuclear reactor core. The code is 
written in FORTRAN and has been executed on a wide variety of platforms and oper- 
ating systems. The code has been extensively benchmarked and results are docu- 
mented in numerous reports [2] for a variety of Light Water Reactor (LWR) applica- 
tions. 

The computational burden in PARCS for solving practical reactor problems can be 
considerable even on the most efficient workstations. The execution time depends on 
the type of reactor transient simulated, but typically the code runs about an order of 
magnitude slower than real time. One of the primary objectives of the work performed 
here was to investigate multiprocessing as a means to reduce the computational burden 
with the ultimate goal of real time nuclear reactor simulation. The target platform for 
this work is a typical multithreaded workstation and therefore threading was the pre- 
ferred parallel model. This paper will describe experience with using POSIX 
Thread[3] (Pthreads) and OpenMP[4] in achieving multiprocessing with the PARCS 
code on SUN and SGI workstations. 
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The paper will first briefly discuss the physics and computational methods used in 
the PARCS code and then describe the code structure and approach used to achieve 
multiprocessing. Sections 3 and 4 will discuss the implementation of Pthreads and 
OpenMP in the PARCS code and analysis of the parallel performance for a practical 
reactor transient, respectively. 



2 Physics and Computational Method 

The solution of the Boltzmann transport equation for the neutron flux distribution in a 
nuclear reactor requires consideration of seven independent variables: neutron energy, 
three spatial dimensions, two angular dimensions, and time. Over the last several 
years, several approximations have proven successful in reducing the computational 
time and yet preserving the accuracy of flux predictions. For typical Light Water Re- 
actor problems, two neutron energy groups and a first order angular flux approxima- 
tion have proven to be effective. The neutron flux and fuel pin powers in the reactor 
can be predicted to within a few percent of measured data. 

At any point in time, the resulting form of the equations is an elliptic PDF which 
can be discretized using conventional methods. The physical core model contains 
about 200 fuel assemblies, each of which contains about 250 fuel pins. Because the 
mean free path of the neutron is about the same as the distance between fuel pins, the 
resulting three-dimensional spatial mesh could be on the order of a million. However, 
innovative “multi-level” type methods have been developed over the years in which 
the pin by pin flux is solved only at the local level and the global problem is solved 
with mesh spacing the same size as the fuel assembly. The resulting local/global itera- 
tion introduces some natural parallelism since the local problems (NODAL module) 
can all be solved simultaneously. Innovative domain decomposition methods were 
also introduced to solve the global (CMFD module) problem. The Coarse Mesh Finite 
Difference (CMFD) solution is based on a Krylov semi-iterative method, Bi- 
Conjugate Gradient Stabilized (BICGSTAB), accelerated with a preconditioning 
scheme based on a blockwise incomplete LU factorization. Parallelism is achieved 
using an incomplete domain decomposition preconditioning method [1]. In addition to 
the Boltzmann transport equation, the single phase fluid dynamics and one- 
dimensional heat conduction equations are solved in PARCS to provide the tempera- 
ture/fluid field in the core. This is performed in the T/H module of PARCS. The feed- 
back to the neutron field equations is provided through the coefficients of the Boltz- 
mann equation or “cross sections” and is performed in the XSEC module of PARCS. 

Parallelism is achieved in PARCS by first dividing the code into the four basic 
modules which solve distinct aspects of the coupled neutron and temperature/fluid 
field calculation: CMFD, Nodal, T/H, and XSEC. The last three of these modules are 
easily parallelizable since all calculations are performed on a node by node basis and 
can be evenly assigned to separate threads. Parallelism of the CMFD calculation is 
more difficult but can be achieved by domain decomposition and incomplete LU pre- 
conditioner as noted above. In the analysis reported here domain decomposition is 
applied by dividing the reactor core into several axial planes. In the model used here 
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the reactor core is divided into 18 axial planes. In some cases, load imbalance will 
contribute to the loss of parallel efficiency since the number of axial planes will not 
always be an even multiple of the number of processors. 



3 Implementation of POSIX Threads and OpenMP 

The target architecture for multiprocessing with PARCS was a shared memory address 
space machine since much of the parallelism was expected to be very fine grain. In 
general, the overhead for message passing parallelism is most suitable for applications 
which are predominantly coarse grain. In general, Pthreads and OpenMP are the stan- 
dard techniques available for multithreaded programming. In the work here, both 
Pthreads and OpenMP were implemented on SUN and SGI workstations. Pthreads 
required considerably more effort than OpenMP to implement within the existing 
PARCS code. Considerable cautions had to be used with Pthreads to avoid such prob- 
lems as race conditions. Conversely, the implementation of parallelism with OpenMP 
was considerably much simpler than Pthreads. It was simply a matter of inserting 
OpenMP directives for parallel regions and specifying which variables were shared 
and/or private. 



3.1 POSIX Threads 

Because Pthreads does not support an interface to FORTRAN, a mixed language pro- 
gramming was required in PARCS. A threads library was developed which is accessi- 
ble to both FORTRAN and C sections of the code. It consists of a minimal set of 
thread functions for FORTRAN which were implemented by FORTRAN-to-C “wrap- 
pers”. This library was named nuc_threads and provided the following four functions: 

nuc_init(*ncpu) 

This routine initializes mutex and condition variables. The initialization was called 
before any other nuc_threads subroutine and after defining the number of threads to be 
used. 

nuc_frk(*func_name,*nuc_arg,*arg) 

This routine creates the Pthreads. The arguments are the name of the function to be 
threaded, the number of arguments for the to-be-threaded function, as well as the 
actual arguments. The nuc_frk uses the Pthreads fucntion pthread_create. 

nuc_bar(*iam) 

This routine is used for synchronization. The nuc_bar uses a counter and the Pthreads 
routines pthread_cond_wait and pthread_cond_broadcast to implement synchroniza- 
tion. 
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nuc_gsum(*iam,*A,*globsum) 

This routine is used to get a global sum of an array updated by each thread. 

One approach to multithread a program is to fork each subroutine which contains a 
parallel region. The benefit of this approach is that it minimizes the global vs. local 
variable concerns. However, it introduces the possibility of performance loss due to 
the overhead of the nuc_frk function. This is particularly true if the subroutine is in a 
loop. It is very possible that the overhead associated with the use of nuc_frk could 
override any performance improvement by parallel computations. To avoid this prob- 
lem and minimize the overhead in the application to PARCS, the main program was 
forked at the beginning of execution. 



3.2 OpenMP 

OpenMP is a standardized set of directives that can be used with FORTRAN, C, and 
C-H- for programming on a shared address space machine. Compared with the 
Pthreads, the implementation is much easier due to the directive based nature of 
OpenMP. The nature of the parallelism in the OpenMP version PARCS (e.g. domain 
decomposition) was essentially identical to the parallelism in the Pthreads version 
implementation. However, there are some differences in the implementation such as 
threads being forked in each subroutine rather than in the main program as in the 
Pthreads implementation. 



4 Application Results 

The performance of Pthreads and OpenMP in PARCS was analyzed using a practical 
nuclear reactor transient benchmark problem. Prior to performing the practical tran- 
sient analysis, however, it was helpful to first examine the performance of a simple 
FORTRAN program performing a matrix vector product. This analysis was used to 
gain some insight about the performance that was to be expected for the practical 
implementation in PARCS. The specifications for the two UNIX platforms used in 
this analysis are shown in Table 1. 



4.1 Matrix- Vector Multiplication 

The subroutine axb.f from the PARCS code which performs matrix-vector multiplica- 
tion was parallelized using Pthreads and OpenMP. The size of vector used was 162KB 
(2x34x17x18x8) which is the same size as that of the benchmark problem presented in 
the next section. The timing results for the matrix vector application are summarized 
in Table 2. F90 version 6.1 is the latest available version of the SUN FORTRAN 
compiler. As shown in the results, the SUN FORTRAN compiler shows an unreason- 
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able increase in execution time when using OPENMP for a single thread compared to 
unthreaded serial execution. This time increase appears to be attributable to the f90 
dynamic memory allocation (DMA) module, since the threaded and serial time are 
similar when DMA is not used. However, DMA is implemented throughout PARCS 
and therefore must be utilized for multiprocessing. In constrast to the OpenMP per- 
formance, the execution time for the Pthreads implementation is very good on the 
SUN. As shown in the Table, the CPU time for serial and single thread execution are 
nearly identical and a near linear speedup (1.95) was observed for execution with 2 
threads. The application of four and eight threads for the SUN platform was not prac- 
tical since the SUN machines used here had only 2 CPUs. 



Table 1. Specification of Machines 



Platform 


SUN ULTRA-80 


SGI ORIGIN 2000 


Number of CPUs 


2 


32 


CPU Type 


ULTRA SPARC II 
450 MHz 


MIPS RIOOOO 
250 MHz 
4-way superscalar 


LI Cache 


16 KB D-cache 

16 KB I-cache 

Cache Line Size : 32bytes 


32 KB D-cache 

32 KB I-cache 

Cache Line Size : 32bytes 


L2 Cache 


4MB 


4MB per CPU 

Cache Line Size : 128bytes 


Main Memory 


1GB 


16GB 


Compiler 


SUN Workshop 6 
-LORTRAN 90 6.1 


MlPSpro Compiler 7.2.1 
- LORTRAN 90 



Table 2. Summary of Execution Time for MatVec Routine 



Machine 


Serial 


OpenMP 


Pthreads 






1* 


2 


4 


8 


1 


2 


4 


8 


SUN 


3.76** 


23.43 


13.26 


- 


- 


3.71 


1.93 


- 


- 






(0.16)*** 


(0.28) 


- 


- 


(1-02) 


(1-95) 


- 


- 


SGI 


1.73 


1.73 


0.92 


0.52 


0.37 


1.72 


1.80 


1.91 


1.96 






(1.00) 


(1.89) 


(3.30) 


(4.72) 


(1-01) 


(0.96) 


(0.91) 


(0.88) 



* : Number of threads 

** : Execution time(milli-seconds) 

*** : Speedup 



On the SGI platform OpenMP shows good parallel performance, however parallel 
speedup could not be achieved with Pthreads since the scheduler assigned all threads 
to the same processor. The Pthreads standard is dependent on the vendor specific 
scheduler to assign threads to processors. Because this study was to compare the per- 
formance of Pthreads and OpenMP, no attempt was made to implement the SGI native 
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threads library SPROC. It was observed that the difficulty in assigning threads to 
processors occurred only for mixed-language programming, i.e., this problem disap- 
pears for the C-language-only program using Pthreads. 

For 2 threads execution on the SGI with OpenMP, a speedup of 1.89 was observed, 
which is comparable to the 1.95 speedup achieved on the SUN with Pthreads. There 
was a noticeable reduction in the parallel efficiency for this problem on larger num- 
bers of processors (ie. 4 and 8 PEs). This is primarily because of load imbalance issues 
since the domain decomposition was restricted to the 18 axial planes which could not 
be evenly distributed to 4 and 8 PEs. 



4.2 PARCS Application 

The practical application with PARCS was to an international reactor transient 
benchmark problem[5]. An important transient in a nuclear reactor is the ejection of a 
control assembly from an initially critical core at hot, zero power conditions. There is 
a significant redistribution of power in the core and the possibility of a local energy 
deposition that could result in the melting of a fuel rod. The focus of the results here is 
on the neutron flux solution, but the transient also requires solution of the tempera- 
ture/fluid field equations. The multithreaded solution of the OECD benchmark was 
performed with PARCS and as shown in Figure 1 below, all threaded versions provide 
the same result as the serial execution. 
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Fig. 1. Core Power versus Time for Serial and Multithreaded Code Execution 



In Table 3 below, the execution time and parallel performance on the SUN machine 
are summarized for the Pthreads and OpenMP versions of the code. The number of 
updates in the table indicates the number of times each module must be executed in 
order to solve the benchmark problem. Similar to the results observed for the matrix- 
vector multiplication shown in Table 2, the OpenMP performance on the SUN is poor. 
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whereas the Pthreads performance is reasonably good. The speedup achieved with the 
parallel Pthreads version is considerably different for each module. For example, the 
T/H and XSEC module show superlinear speedup while the CMFD and NODAL 
module speedup is only 1.77 and 1.78 respectively. As noted earlier, the T/H and 
XSEC modules are inherently parallel and therefore superlinear speedup is possible. 
Conversely, the solution of the elliptic PDE in the CMFD module uses domain de- 
composition methods and as indicated in the Table, there is a slight increase in the 
number of linear solutions required because of additional iterations. Also, the number 
of updates in the NODAL module increased due to the increased number of calls to 
the CMED module. The increase of the number of updates for 2-threads case in 
CMED and NODAL modules are small, just 2.5% and 6.5% respectively. Considering 
the naturally parallelizable nature of the NODAL module, the speedup of 1.78 is lower 
than expected even with the increase in the number of updates. A more detailed analy- 
sis was performed to understand these results from the standpoint of cache utilization 
and will be presented in the next section. 



Table 3. Summary of Parallel Execution Times on the SUN 



Module 


Serial 


OpenMP 


Pthreads 


1* 


2 


Speedup 


1 


2 


Speedup 


Time 


CMED 


36.7 


107.0 


59.0 


0.62 


32.1 


20.8 


1.77 


(sec) 


Nodal 


11.5 


22.8 


17.4 


0.66 


11.3 


6.4 


1.78 




T/H 


29.6 


32.6 


19.1 


1.56 


27.9 


14.5 


2.04 




Xsec 


7.6 


47.5 


24.3 


0.31 


7.1 


3.7 


2.04 




Total 


85.4 


209.8 


119.8 


0.71 


78.5 


45.5 


1.88 


# of 


CMED 


445 


445 


456 


- 


445 


456 


- 


Updates 


Nodal 


31 


31 


33 


- 


31 


33 


- 




T/H 


216 


216 


216 


- 


216 


216 


- 




Xsec 


225 


225 


226 


- 


225 


226 


- 



* : Number of threads 



The performance of PARCS on the SGI is summarized in Table 4. As indicated, the 
serial execution time on the SGI is reduced by almost 30% compared to the SUN. 
Similar to the Pthreads performance for the matrix vector multiplication shown in 
Table 2, it was not possible to schedule threads on the SGI and therefore no speedup 
was observed on the SGI with Pthreads. Conversely, the OpenMP performance on the 
SGI is very good. The speedup achieved with 2 threads (1.85) using OpenMP on the 
SGI is comparable to the 2-thread performance achieved with Pthreads on the SUN 
(1.88). In particular, the T/H and XSEC modules again show a superlinear speedup, 
whereas the CMED and NODAL modules show a relatively low speedup. As will be 
discussed in the next section, this can be attributed in part to the efficiency of the 
cache utilization in each module. The speedups on 4 and 8 processors is poor, primar- 
ily because of problems balancing the load. As noted earlier, domain decomposition 
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was implemented on a plane-wise basis and the 18 planes in the model are not evenly 
assignable to 4 and 8 processors. 



Table 4. Summary of Execution Time on SGI 



Module 


Serial 


OpenMP 


Pthreads 


1* 


2 


5 ** 


4 


S 


8 


S 


1 


2 


S 


Time 


CMFD 




19.3 


12.1 


1.63 


8.93 


2.21 


8.85 


2.23 


19.4 


20.9 


0.95 


(sec) 


Nodal 


9.0 


9.2 


5.8 


1.55 


3.56 


2.53 


2.87 


3.14 


9.2 


9.7 


0.92 




T/H 


26.6 


25.3 


12.3 


2.17 


8.92 


2.99 


7.14 


3.73 


25.2 


25.5 


1.05 




Xsec 


4.8 


4.4 


2.4 


2.01 


1.37 


3.53 


1.11 


4.35 


4.8 


5.0 


0.97 




Total 


60.2 


58.1 


32.6 


1.85 


22.8 


2.64 


20.0 


3.02 


58.6 


61.1 


0.99 


#of 


CMFD 


445 


445 


456 


- 


497 


- 


565 


- 


445 


456 


- 


Updates 


Nodal 


31 


31 


33 


- 


38 


- 


39 


- 


31 


33 


- 




T/H 


216 


216 


216 


- 


216 


- 


217 


- 


216 


216 


- 




Xsec 


225 


225 


226 


- 


228 


- 


227 


- 


225 


226 


- 



* : Number of threads 

** : Speedup 

*** : Execution time (seconds) 



4.3 Cache Performance Analysis 

The processors on both of the platforms used in the analysis here are relatively fast, 
which suggests that data access times will be particularly important to improve overall 
code performance. Previous studies[6] have shown that cache utilization, particularly 
the L2 cache hit ratio, can be important for achieving good performance on modern 
workstations. Cache performance was analyzed for each PARCS module using the 
hardware counter library available on the SGI platform. As was shown in Table 4, the 
most time consuming part of the code is the T/H module, specifically the subroutine 
TRTH which solves the convection and heat transfer equations for the reactor coolant 
channel. The second most time consuming part of the code was the CMFD module, 
specifically the subroutine BICG which implements the Krylov method to solve the 
linear system. Therefore, BICG and TRTH were used to analyze cache performance 
for the CMFD and T/H modules, respectively. The cache performance of the other two 
modules, NODAL and XSEC, was analyzed for all the subroutines in the module. For 
the analysis shown here a consistent level 2 optimization was used for both serial and 
threaded versions of the code. 

The data cache misses for each module are summarized in Table 5. The results 
were normalized to the serial misses as shown in Table 6. As shown in Table 6, the 
normalized L2 cache misses for 2 threads are 1.71 for XSEC, 1.66 for CMED, and 
1.60 for NODAL. This correlates reasonably well the speedups shown in Table 4 for 
XSEC, CMED and NODAL of 2.01, 1.63 and 1.55, respectively. This suggests that for 
these three modules the overall performance is determined primarily by the L2 cache 
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utilization. This is reasonable since, as shown in Table 7, an unsatisfied L2 cache miss 
can result in an order of magnitude penalty in machine cycles. The behavior of the 
T/H module does not appear to follow this trend since the normalized L2 data cache 
misses for 2 threads is very low (0.99) while the parallel performance is excellent 
(2.17). For the T/H module it appears that the LI data cache plays a more important 
role. As shown in Table 5, the number of LI data cache misses is about 20 times large 
than L2 cache data cache misses. As shown in Table 6, the LI data cache misses for 
the threaded code are reduced by a factor of 4, which is very large compared with the 
other modules. Examination of the assembly code revealed some fundamental differ- 
ences in the way the compiler treated the serial and threaded versions of the T/H sub- 
routine. For example, the threaded version of the code had considerably more pre- 
fetch commands. 



Table 5. Data Cache Misses 



Module 


Cache 


Serial 


OpenMP 








1* 


2 


4 


8 


CMFD 


LI 


477,691 


479,474 


258,027 


156,461 


105,733 


(BICG) 


L2 


28,242 


29,650 


17,007 


11,751 


9,309 


Nodal 


LI 


857,744 


853,866 


444,849 


249,507 


160,699 




L2 


54,163 


55,534 


33,846 


19,016 


12,848 


T/H 


LI 


165,133 


60,587 


39,419 


25,850 


19,816 


(TRTH) 


L2 


9,551 


9,512 


9,673 


6,451 


4,620 


XSEC 


LI 


62,324 


57,462 


29,845 


17,715 


11,344 




L2 


9,456 


9,518 


5,517 


3,737 


2,578 



* : Number of threads 



Table 6. Data Cache Misses (Normalized) 



Module 


Cache 


Serial 


OpenMP 








1* 


2 


4 


8 


CMFD 


LI 


1.00 


1.00 


1.85 


3.05 


4.52 


(BICG) 


L2 


1.00 


0.95 


1.66 


2.40 


3.03 


Nodal 


LI 


1.00 


1.00 


1.93 


3.44 


5.34 




L2 


1.00 


0.98 


1.60 


2.85 


4.22 


T/H 


LI 


1.00 


2.73 


4.19 


6.39 


8.33 


(TRTH) 


L2 


1.00 


1.00 


0.99 


1.48 


2.07 


XSEC 


LI 


1.00 


1.08 


2.09 


3.52 


5.49 




L2 


1.00 


0.99 


1.71 


2.53 


3.67 



* : Number of threads 






The Application of POSIX Threads and OpenMP 



99 



Table 7. Typical Memory Access Cycles 



Memory Access Type 


Cycles 


LI cache hit 


2 


LI cache miss satisfied by L2 cache hit 


8 


L2 cache miss satisfied from main memory 


75 



In order to relate code performance to machine specific characteristics, the speedup 
was estimated using the data access times: 



S 



■p serial 
total 
rp 2th 
total 



( 1 ) 



where 

= Total data access time for serial execution 
'^mai ~ Total data access time for 2 threads execution. 



And the total data access time is approximated by L2 cache access time and main 
memory access time: 

^total '^Ll T Tmem ^L2 ^ L2 T ^ Mem ^ Mem 

where 

Tl 2 = Total L2 cache access time 
Tmem = Total memory access time 

n ^2 = Number of LI data cache misses satisfied by L2 cache hit 

^Mem ~ Number of L2 data cache misses satisfied from main memory 
= L2 cache access time for 1 word 
^Mem ~ Main memory access time for 1 word. 

Based on the above simple model and the data shown in Tables 5 and 7, the 
speedup for each module were estimated in Table 8. As indicated, the results are in 
reasonable agreement with the actual measured results shown in Table 4. 



Table 8. Estimated 2-Thread Speedup Based on Data Cache Misses for OpenMP on the SGI 



Module 


Speedup 


Measured 


Predicted* 


CMFD (BICG) 


1.63 


1.78 


Nodal 


1.55 


1.80 


T/H (TRTH) 


2.17 


2.04 


XSEC 


2.01 


1.86 



* : Predicted by Eq. (1) 
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5 Conclusions 

Two parallel versions of the nuclear reactor transient analysis code PARCS were de- 
veloped using Pthreads and OpenMP. The parallel performance was analyzed on SUN 
and SGI multiprocessors. Some unresolved issues remain regarding implementation of 
Pthreads on the SGI and OpenMP on the SUN. However, the overall performance of 
OpenMP on the SGI was comparable to Pthreads on the SUN. The performance varied 
depending on the type of calculations performed in each module. A simple predictive 
model was developed based on cache access times and the results agreed reasonably 
well with measured performance. Considering the effort required for implementation, 
the directive based standard OpenMP appears to be the preferred choice for parallel 
programming on a shared memory address machine. Future work will include the 
development of three-dimensional domain decomposition methods that will help alle- 
viate load imbalance issues and improve the scalability of the code. 
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1 Introduction 

OpenMP is an accepted portable programming model for shared-memory platforms. An 
advantage of OpenMP over MPI is that it offers a simple transition from sequential to 
parallel programs. Another question is, how OpenMP can be used in conjunction with 
higher level approaches to parallel programming. 

In this paper, we report about our experiences regarding the integration of OpenMP 
into Janus — a parallel C-I-+ template library for mesh and grid based scientific 
applications |[5j. The issues emphasized here are not primarily to use OpenMP for the 
parallelization of a sequential program. The challenge is to map Janus’ application- 
oriented and explicit description of communication onto OpenMP primitives in order to 
efficiently utilize the underlying shared memory. 

Janus uses the ideas of generic programming^ to identify abstractions and provides 
building blocks for complex parallel scientific applications. Janus clearly distinguishes 
between spatial structures (grids, meshes, or graphs) and the numerical values that are 
associated with them. The architecture and main template classes of Janus are briefly 
present in m 

In (d we discuss how OpenMP has been integrated into Janus components. In a 
subsequent section we discuss a parallel implementation of Conway’s Game of Life that 
uses the grid and stencil templates from Janus. It turns out that OpenMP can be 
integrated smoothly into the Janus abstractions that were originally designed to hide the 
low level details of an underlying message passing system. Moreover, as we show in m 
no dramatic performance losses occur. 

The test platforms for our two applications are both a for 4-processor Solaris shared 
multiprocessor system where we use the OpenMP compiler of KAI|3] and 1 6 node Linux 
PC Cluster featuring a fast MPI implementation. Finally, we summarize our experience 
and outline our future research. 



2 The Janus Framework 

The C++ template library Janusf^VfH provides basic building blocks for the implementa- 
tion of mesh/grid-based scientific applications. The Janus components rest on a simple 
yet very expressive conceptual framework. Its basic idea is that there occur essentially 
two types of objects in scientific applications. These are, on the one hand, spatial struc- 
tures which are, for example, grids, triangulations and graphs. On the other hand, we 



R. Eigenmann and M.J. Voss (Eds.): WOMPAT 2001, LNCS 2104, pp. lOl JTTn 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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have (numerical) data associated with these spatial structures, such as grid functions, 
and (sparse) matrices. Fundamental for the Janus framework is the observation that the 
spatial structures are prior to and more stable than the associated data. 



2.1 Generic Programming 

David Musser defines generic programming as “programming with concepts” [^. A 
Concept is set of type requirement. These requirements can be categorized into 



- interface requirements that concern the syntax of involved types, 

- property requirements that deal with semantic constraints. 

A type that fulfills the requirements of a concept is called a model of that concept. 
Generic programming allows a grouping types that is independent of inheritance hi- 
erarchies. This avoidance of explicitly naming of a base interface is closely related to 
structural suht\pin9 \\ 111 . 

A well-known example of generic programming are the concepts and components 
of the C-H- Standard Template Libraryl]{j| (STL) which provides many of the basic 
algorithms and data structures of computer science. Its basic concepts are Container and 
Iterator. 

Algorithms of STL are written in terms of iterators, that is, independent of containers 
in data-structure neutral way. This decoupling of algorithms and containers reduces 
the code size of the STL drastically since a single template function can operate on 
different classes of containers. The algorithm/data-structure interoperability is a key 
aspect of STL’s genericity. Two other aspects are Function Objects that can be supplied 
to containers and algorithms to adapt them to the need of a particular application, and 
Element Type Parameterization, that is, to plug basically every user-defined type into 
STL containers. 



2.2 Concepts of Janus 

The three basic concepts of the Janus framework are Domain, Relation, and Domain 
Function. 

The role played by iterators in the STL is taken by the position of domain elements. 
They are the key for the domain/relation interoperability in Janus. They also decouple 
domain functions from the internal details of an underlying domain or relation. 

Using positions instead of iterators appears at first sight less flexible. However, the 
priority and relative stability of the spatial structures, mentioned above, justifies this 
design decision. 

Yet, the challenge is to express the relative stability in the context of irregular struc- 
tures. Janus provides so-called two phase data structures with explicitly separated inser- 
tion and retrieval phases to solve this problem. 
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The Domain Concept. A domain is a finite set of elements that is given as a sequence. 
Each domain element has a unique (and fixed) position in its domain (its sequence index). 
The fixed one-to-one correspondence of domain elements and positions is crucial for 
Janus and a direct consequence of the stability of the spatial structures. 

The basic requirements are similar to those of STL containers. A model of Domain 
must have nested types value_type and size_type. It must also provide methods 
sizeO to query its number of elements, value_type operator[] (size_type) to 
return the element, and a method size_type position(value_type) to inquire 
the position of an element. These methods let domains appear like searchable random 
access containers. 

Models of Two-Phase Domain must provide a method insert (value_type) and 
a method freeze that must be called after all elements have been inserted and before 
any element is accessed. Thus, freeze marks the transition from the insertion to the 
retrieval phase. 

The Domain Function Concept. Data that are associated with domains or relations can 
exploit the fixed one-to-one correspondence of domain elements and their positions. The 
concept Domain Function describes data associated with a domain as a one-dimensional 
random access data-structure. It must have the same (fixed) length as the underlying 
domain or relation. 

The Relation Concept. Domains have no visible structure. In order to describe depen- 
dences of domain elements there is the concept Relation. Relations between two domains 
are represented in terms of the positions of the domain elements and are described as 
adjacency matrices. In other words, a relation R between two domains X and Y is 
considered as the set of pairs of integers i?' = | {xi,yj) G i?}. 

The relations in scientific compufafions are often sparse. Therefore, the Relation 
concept provides operations that act on domain functions associated with its domains. 
These operations are generalizations of sparse matrix-vector multiplications and can be 
customized by function objects. 

Its simplest form pull template function is in fact a sparse matrix-vector multipli- 
cation with the adjacency matrix. If 6 is a domain function on the domain Y then the 
result of pulling b with R onto X is a domain function a on A for which holds 



The pull and their transposed push operation are utilized to express (remote) data transfers 
between (distributed) domains and can be customized to perform user-defined operations 
on the gathered data. 

The following tabledgives an overview on the main data transfer methods provided 
by Janus relation classes. In particular the pull_visitor and pull_matrix_visitor tem- 
plates are a very powerful means to specify complex user-defined operations. They have 
been inspired by the use of the Visitor Design Pattern||Tl| in the Boost Graph LibraryP^J 




bj for all i. 



(1) 






(BGL). 
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Table 1. Main data transfer operations required by Relation 



Name 


Semantics for all positions i of X. 


pulKb, a) 


See Equation^ 


pull_fun(b, a, F) 


F 




pull_visitor (b, V) 


V 


Voi(id)efl'} / 


pull_matrix(m, b, a) 


ai i — 


( fli -f ) 


pull_matrix_fun(m, b, a, F) 




rriijbj, ai 


pulljnatrix_visitor (m, b, V) 




^ rriijbj, i\ 

{j\(iJ)eR'} / 



Note that all pull and related must be members of a relation class. The reason that 
we decided against globally defined template functions is that pull methods need a lot 
low level state information in order to be implemented efficiently. This holds in particular 
for the a parallel implementation which is the main focus of Janus. 

This means that we heavily rely on template member function for the implementation 
of Janus. 

2.3 Components of Janus 

The core of Janus are six template classes that are shown in table|21 The first two classes 
grid and domain are models of Domain. The grid template can be used for a compact 
representation of rectangular grids. The domain template is a two-phase domain that 
can be used to represent irregular meshes. 

The class array is a simple wrapper around the std : : vector template. It provide 
overloaded versions of numerical operators and can query the size of an underlying 
domain or relation. It mainly exists for convenience — principle std: : vector or even 
one-dimensional C-arrays fulfill the simple requirements of the concept Domain Func- 
tion. 

The remaining three classes are models of the concept Relation. They hide the details 
of three different sparse matrix formats and also provide methods for the convenient 
generation of the patterns related with these formats. 

Janus does not provide many proper algorithms, i.e., algorithms that are implemented 
as functions and not as class methods. This is because the methods of the Relation classes 
are a very powerful mean to express complex data parallel operations. There are, however, 
parallel versions of some of the STL algorithms, for example, count and accumulate, 

There is also a special class of accumulator algorithms in Janus that solve the problem 
of efficient assignment or values to a domain function on a distributed domain. 
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Table 2. Main template classes of Janus 



grid<N> 


An rectangular N-dimensional grid. 


domain<T> 


Indexed set of T objects. 


array<Dom, V> 


Domain function on an arbitrary Dorn. 


stencil<N,Gen> 


General stencil between two N-dimensional grids. 


n_relat ion<X , Y , Gen> 


Regular sparse relation between two domains. 


relation<X, Y> 


General sparse relation between two domains. 



3 Integration of OpenMP 

The conceptual framework of Janus supports independent configurations of its compo- 
nents for distributed and shared memory architectures. As Figure [I] shows, basically 
all user-visible components are available for distributed and shared memory operation 
modes. 



Shared 

Memory 


General 

Utilities 


Distributed 

Memory 


Local Core F 
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JADE 


liii 
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llllllllllllll 



Janus 

Packages 



External 

Software 



Fig. 1. The Janus Software Architecture 



There are only slight differences regarding the interfaces and semantics. For example, 
mapping information can be specified for the domain template class when configured 
for distributed memory. Another difference concerns the size () of Domain. In case of 
a distributed configuration, it returns only the number of local elements. To inquire the 
number of all elements of a distributed domain the method total_size must be used. 

The distributed subfamily of Janus rests on a (still experimental) port package Jade 
(Janus Distributed Engine). By default. Jade is implemented on top of MPI. We are also 
working on a port of Jade to the MTTL|Q. 

Both the shared and distributed memory abstraction share a considerable amount of 
local core functionalities. This includes basic sparse relation functionalities. This small 
subset is the place where OpenMP directives are mainly used. 

To be more precise, we consider the general structure of the three models stencil, 
n_relation, and relation of the concept Relation. Each model R has an associated 
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Fig. 2. Internal structure of the Janus relation classes. The arrows express use relationships. Only 
the Ffun module is annotated with OpenMP directives. 



sparse (matrix) format F — see Figure |2 For example, the associated sparse format for 
stencil is derived from the diagonal sparse format DIA0, whereas relation use 
the compressed row storage CRS[pJ format. Different to conventional object-oriented 
design, a format F is split into and representation type Frep and function module Ffun. 
The module Ffun contains the basic operations to implement the pull methods of a 
relation (see tabled, and only this module contains OpenMP directives. 

The following code fragment in FigureQ shows the pull method from the function 
module of the DIA format. The structure of this method is very simple. I consists of a 
loop over a flag field that indicates whether the current position refers to an inner grid 
point where the stencil can safely applied. 

1 template<typename Offset, typename T> 

2 void pull (const Dffsetft offset, const size_vector& valid, 

3 const T* s, T* t) 

4 { 

5 const size_t* flag = &valid[0]; 

6 int fsize = valid. sizeO ; // int is for guide++ 

7 int i ; 

8 #ifdef _DPENMP 

9 #pragma omp parallel sharedCflag, fsize, s, t) 

10 { 

11 #pragma omp for private(i) schedule (runtime) 

12 #endif /* _QPENMP */ 

13 ford = 0; i < fsize; i++) 

14 if(flag[i]) dia_row: :pull(i, offset, s, t) ; 

15 #ifdef _DPENMP 

16 > 

17 #endif /* _QPENMP */ 

18 } 



Fig. 3. The Janus dia: :pull method with OpenMP directives. 
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The actual operations are hidden in the small inline function dia_row: ;pull. The 
simple structure of the dia; :pull makes it a perfect candidate to be generated as a 
wrapper around dia ; ; pull_row. Regarding the many data transfer methods of a Janus 
relation class (see table |J1), this would drastically simplify the tedious work of writing 
OpenMP directives. Generating the code was one reason to explicitly separate data 
representation and functionality into two different modules. 

The class Frep and the module Ffun are used to implement 

- complete : : F — an implementation of F for shared memory, 

- distributed : : F — an implementation of F for distributed memory. 

These classes are — as Figure El suggests — used by corresponding implementations 
of the Relation type R. We use conditional compilatiorlJto map the names from the name 
spaces complete or distributed to our top level namespace jns. 

This design has the advantage that OpenMP can be utilized not only for a shared 
memory computing platform but also for distributed shared memory systems. 



4 An Application Case Study 



Our goal is to integrate OpenMP into Janus in order to exploit OpenMP’s ability to 
portably utilize the parallel paltforms with shared memory. Ideally, as mentioned above, 
we can hide OpenMP almost completely within the Janus abstractions. This is possible, 
if the Janus user avoids using loops that express data parallel operations. So the question 
is, whether Janus is expressive enough to support a wide range of data parallel scientific 
applications. 

To preliminarily answer this question, we consider a simple scientific application: 
Conway’s Game of Life. This is a well-known cellular automata simulation that involves 
rectangular grids and simple stencils to express data communication. We show how to 
write parallel implementations of this problems with the application-oriented Janus 
abstractions. 



4.1 Conway’s Game of Life 

In Life, the evolution of a population of cells is considered. The underlying spatial 
structure is a rectangular grid (see FigureEJ. A value of 1 represents a living cell (marked 
by •), a value of 0 represents unoccupied grid points (not marked). The cells change 
their state in a lock step manner depending on the number of neighboring living cells 
and their own state. The neighborhood of a grid point is described by an 8-point stencil 
(see Figure El). We apply the stencil only in the interior of the grid to exclude the issue 
of incomplete stencil environments at the boundary. 

* The C preprocessor flags read _JANUS_SHARED_ and _JANUS_DISTRIBUTED_, respectively. 
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Fig. 4. Two successive generation in Life 



4.2 A Janus Implementation of Life 

The following code of Figure^lshows an implementation of Life that uses the basic Janus 
classes grid and stencil. We comment the Janus constructs used in this program. 
Later we discuss alternatives that avoid the proliferation of OpenMP directives into the 
application program. 

Lines 1 and 20 show the initialization and shutdown of Janus runtime environment. 
Depending on the actual configuration, these functions call their respective counter parts 
in MPI and OpenMP. 

Lines 3 through 6 cover the initialization of the basic application parameters. The type 
nested type grid<N> ; ; value_type (see is a typedef for svector<int ,N> which 

is a Janus utility for representing a container whose size is known at compile time. The 
variables upper and lower denote the upper right and lower left corner of the full grid. 
The variable shift is an offset for the respective entities of the inner grid. 

Lines 7 and 8 declare the full grid object outer and its interior inner. 

Line 9 shows the declaration and initialization of the 8-point stencil that describes the 
neighborhood of a grid point. For this the stencil template of Janus is used. Its first 
template parameter specifies the dimension of the involved grids. The second one is a 
function object (here stencilS) that gives a formal description of the actual stencil 
pattern. For Life this descriptions reads 

1 ifj + l), (i + lj-l), {i + l,j), (i+l,j + l). 

The definition of stencilS is shown in Figure El The stencilS function object is 
evaluated only in the constructor of stencil which transforms it into a very compact 
sparse diagonal representation. 

Lines 10 through 12 show the declaration of two integer- valued domain function on the 
full grid. Here the array template of Janus is used. The state of the actual cell population 
is represented by the state domain function (remember: a value of 0 means no cell and 1 
means living cell). The domain function sum will used later to hold for each grid point 
the number of living neighboring cells. 
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1 int mainCint argc, char** argv) {. 

2 j ns :: initialize (argc , argv); 

3 size_t iterations=500 ; 

4 jns : :grid<2> : : value_type upper(1000, 1000); 

5 jns : :grid<2> : : value_type lower(0,0); 

6 jns : :grid<2> : : value_type shift(l,l); 

7 jns::grid<2> outer(lower, upper); 

8 jns::grid<2> inner (lower+shift, upper-shift); 

9 jns : : stencil<2, stencil8> neighbors (inner, outer); 



10 


jns: :array<jns: : grid<2> , int> state (outer . size 


:()); 


11 


// initialization of state not shown 




12 


jns : : array<jns : : grid<2> , int> sum(outer . size () ) ; 


13 


for(size_t i = 0; i < iterations; i++) { 




14 


neighbors .pull (state , sum); 




15 


for(size_t j = 0; j < outer . size () ; j++) 




16 


evaluate(sum, state); 




17 


J 




18 


size_t c = jns :: count (state , 1); 




19 


cout « c « " of " « outer .total_size() << 


endl ; 


20 


jns: : finalize 0 ; 




21 } 







Fig. 5. A straight-forward Janus implementation of Life 



Lines 13 through 17 represent the main loop of the simulation. In each simulation 
step, at hrst, the number of living neighboring cells is counted for each grid point. For 
this the pull template function of stencil (see equation □ of is applied to the 

state domain function. In the subsequent loop, the number of living neighboring cells 
is evaluated and the state in each grid point is updated. The implementation of the hereby 
used inline function evaluate is given in Figured 

The remaining lines of the Life program in Figure d show how using the count 
algorithm of Janus the hnal number of living cells can be determined (in parallel). Note 
that in line 19 the method total.size must be called to determine the number of all 
elements of a (potentially distributed) domain (see also 

4.3 Discussion and a First Improvement 

With respect to OpenMP, the two lines 15 and 16 are the biggest problem of the im- 
plementation shown in Figure d Putting the OpenMP directives into the application 
program is a far from optimal solution. Note that this is not an issue, when Janus is 
conhgured to solely use MPI since then this loop would be executed as part of a parallel 
processes. 
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struct Stencils { 

typedef jns : : svector<int ,2> argument_type ; 

typedef jns : : svector<argument_type ,8> result_type; 
result_type operator () (argument_type v) const {. 
result_type r(v); // copy v 

size_t k = 0; 

forCint i = -1; i < 2; i++) 

forCint j = -1; j < 2; j++) 
if((i != 0) II (j != 0)) 

r [k++] += argument_type(i, j) ; 

return r; 

> 



Fig. 6. Generator definition of a two-dimensional 8-point stencil 

1 inline void evaluate(int& sum, int& state) { 

2 if (sum == 3) state = 1; 

3 else if ((sum == 2) && state) state = 1; 

4 else state = 0; 

5 sum = 0 ; 

6 } 



Fig. 7. Implementation of the evaluate function. Note that this function also resets the sum value 
to 0. 



A straight-forward solution to this problem can be provide by a generic wrapper for 
simple loop expressions that hides the OpenMP directives. In fact, Janus provide the 
template function f or.visitor that does exactly this. 

In order to understand how this template function is used we look at the following 
code fragment in FigurelHl 

4.4 A Second Improvement: Combining pull and Visitors 

In the case of Life, we can add a further improvement by applying the visitor function 
object while executing pull. For this, Janus relations provide the template member 
function pull_visitor (see table nj. The big advantage is that only one loop must be 
executed during each iteration step. 

The code fragment in FigurefTTlreDlaces the lines 12 through 17 of example in figure 

5 Performance Measurements 

We measured the time for the code fragment shown in Figure[TT3because it is the fastest 
of the three Janus implementations (see also the Figures 0 and 0. We also wrote a 
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evaluate_visitor eval(&sum[0] , &state[0]); 



for(size_t i = 0; i < final; i++) { 
neighbors .pull (state , sum); 
jns :: for_visitor (outer , eval) ; 



> 



Fig. 8. Using Janus f or_visit or in Life. The five code lines replace the lines 13 through 17 of the 
original implementation in Figure[51 Instead of explicitly iterating over the outer grid object. The 
function object eval is applied by jns: :for_visitor at each position of outer. The type of 
eval is evaluate_visitor whose implementation is shown in Figure^ This class has pointers 
to the data of the domain functions state and sum that are initialize in line 1 of Figure|^ 

1 struct evaluate_visitor { 

2 int* sum; 

3 int* state; 

4 evaluate_visitor (int* sm, int* st) : sum(sm) , state(st) {}■ 

5 void operatorO (size_t i) { 

6 if(sum[i] == 3) 

7 state [i] = 1; 

8 else if ((sum[i] == 2) && state [i]) state [i] = 1; 

9 else stated] = 0; 

10 sum[i] = 0; 



Fig. 9. Implementation of evaluate_visitor. This function object has only one entry point, 
namely, operatorO that expects a single size_t argument. This argument is the domain posi- 
tion to which the function object is applied. Domain functions that shall be visited (for reading or 
manipulation) must pass the starting address of their data to the visitor. Remember, it is a funda- 
mental Janus requirement i H2.2l) that the positions of a domain are in one-to-one correspondence 
with the indices of an associated domain functions. 

Straight-forward implementation of Life to compare this with our Janus implementation. 
A 1000 X 1000 grid was used and 500 iteration steps performed. 

Since Janus can be configured to use OpenMP or MPI, we also provide performance 
numbers for our 16 node Linux cluster. 

5.1 Janus and OpenMP on a Shared Memory System 

Our primary test system for this paper was a small shared-memory system with four 
CPU (Sun E3000, UltraSPARC-! 168MHz). 

We used KAPs Guide OpenMP compiler (version 3.9) with the options +K3 -x04 
-fast -xtarget=ultra. In order to compare overheads due to the use of OpenMP, 
we also used Sun’s KCC C-H- Compiler (version 3.4) with the same options as Guide. 
We also measured times for using the GNU C-H- compiler (version 2.95.2, using the 
options -03 -funroll-loops -f expensive-optimizations. 



11 J 

12 }; 
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1 jns : : array<jns : : grid<2> , int> new_state(outer . sizeO ) ; 

2 for(size_t i = 0; i < final; i++) { 

3 evaluate_pull_visitor eval(&state [0] , &new_state [0] ) ; 

4 neighbors. pull_visitor(state, eval) ; 

5 state . swap (new_state) ; 

6 > 



Fig. 10. Using pull_visitor in Life. First we note that instead of the domain function sum the 
domain function new_state is used. In line 3, the start addresses of state and new_state are 
bound to the visitor function object eval. The type of eval is evaluate_pull_visitor whose 
implementation is shown in FigurefTTl After calling pull_visitor in line 4 the state of the next 
generation of cells is stored in the domain function new_state. We use the swap member function 
(line 5) to efficiently assign new_state to state. 

1 struct evaluate_pull_visitor { 

2 const int* statel; 

3 int* state2; 

4 evaluate_pull_visitor (const int* si, int* s2) : 

5 statel (si) , state2(s2) {}■ 

6 void operatorO (int sum, size_t i) { 

7 if (sum == 3) state2[i] = 1; 

8 else if ((sum == 2) && statel [i]) state2[i] = 1; 

9 else state2[i] = 0; 

10 y 

11 }; 

Fig. 11. Implementation of evaluate_pull_visitor. This function object has the only entry 
point operatorO . It expects two arguments: the first one is the value that represents the result 
of pull at i-th position of the first domain. The second one is exactly this position. As in the case 
of the visitor in Figure|3 domain functions that shall be visited must pass the starting address of 
their data to the visitor. 



The pure C implementation is roughly 1.5 faster than the Janus implementation. The 
C implementation also scales better and achieves a speedup of 4.52 for four threads 
(compared to a speedup of 3.62 in case of Janus). These are probably cache effects that 
are clearer visible in the C implementation. Therefore, the results for four threads show 
that the C implementation is almost times two times faster than Janus. 



5.2 Janus and MPI on a Distributed Memory System 

Our computing platform for the distributed case is a Linux cluster consisting of 16 
Athlons running at 550 MHz. The nodes are connected through a Myrinet network that 
is utilized through the BIP0 implementation of MPI. 

Table^shows that the Janus program delivers an acceptable absolute speedup of 12 
when using 16 nodes. 
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Table 3. Results for OpenMP on a Sun system 





Janus Implementation 


C Implementation 


Compiler (Threads) 


Time in s 


Speedup 


Time in s 


Speedup 


Guide (1) 


66.87 


1.00 


42.59 


1.00 


Guide (2) 


34.68 


1.93 


18.74 


2.27 


Guide (3) 


23.31 


2.87 


12.21 


3.49 


Guide (4) 


18.48 


3.62 


9.43 


4.52 


KCC 


55.66 


- 


34.07 


- 


g++ 


62.37 


- 


35.23 


- 



Table 4. Result for MPI on a Linux cluster 



Nodes 


sequential 


1 


2 


4 


8 


16 


Time in s 


48.03 


63.05 


31.64 


15.92 


7.94 


3.97 


Speedup 


1.00 


0.76 


1.52 


3.02 


6.05 


12.10 



6 Future Works 

So far, we found our experience with integrating OpenMP into Janus quite encouraging. 
For this (simple) application we could achieve reasonable speedup both when using 
OpenMP and MPI. More important, we could completely hide OpenMP directives within 
Janus. A next step for us is to integrate OpenMP into the remaining Janus components 
(see table 0. An Janus implementation of the Bellman-Ford Shortest Graph algorithm 
showed that the Janus constructs are also expressive enough for more complex and in 
particular irregular applications. 

The next but one step is then to exploit the easy conhgurability of J anus and combining 
OpenMP and MPI within our framework. This would make our library very attractive 
for SMP clusters. So far however, we do not have access to SMP clusters with OpenMP 
support. This highlights a general drawback of the OpenMP approach. OpenMP must 
be integrated into a C-H- compiler (in contrast to MPI). This hinders its availability, in 
particular on high performance machines. 

Janus can be downloaded from www . f irst . gmd . de/ j auius. 
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Abstract. In contrast to the common belief that OpenMP requires 
data-parallel extensions to scale well on architectures with non-uniform 
memory access latency, recent work has shown that it is possible to de- 
velop OpenMP programs with good levels of memory access locality, 
without any extension of the OpenMP API. The vehicle for localizing 
memory accesses transparently to the programming model, is a runtime 
memory manager, which uses memory access tracing and dynamic page 
migration to implement automatic data distribution. This paper evalu- 
ates the effectiveness of using this runtime data distribution method in 
non embarrassingly parallel codes, such as the SPEC benchmarks. We 
investigate the extent up to which sophisticated management of phys- 
ical memory in the runtime system can speedup programs for which 
the programmer has no knowledge of the memory access pattern. Our 
runtime memory management algorithms improve the speedup of five 
SPEC benchmarks by 20-25% on average. The speedups are close to the 
theoretical maximum speedups for the problem sizes used and they are 
obtained with a minimal programming effort of about a couple of hours 
per benchmark. 



1 Introduction 

There is an ongoing debate between developers and users of OpenMP, about 
how should OpenMP be extended to scale better on architectures with non- 
uniform memory access latency, such as ccNUMA multiprocessors and clusters 
of SMPs p2Kll 1[ . Most of the related proposals converge to the conclusion that 
OpenMP should be extended with data distribution directives similar to the 
ones used in data-parallel programming languages like HPF. This argument is 
counterweighted by a recent research outcome j!SII ()) which suggests that it is 
possible to replace manual data distribution with intelligent runtime memory 
management algorithms, which infer the memory access pattern of the program 
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and the most appropriate placement of data from traces of memory accesses 
collected in hardware counters. 

We have developed a runtime memory manager which localizes transparently 
the memory accesses of OpenMP programs on tightly-coupled ccNUMA multi- 
processors m- The memory manager utilizes snapshots of memory access traces 
and dynamic page migration, to perform data distribution in a manner that min- 
imizes the latency of remote memory accesses. The distinguishing feature of our 
runtime system is that it works transparently to the programmer and requires no 
modifications to the OpenMP API. It requires merely a simple instrumentation 
pass by the OpenMP translator to activate the memory manager. The runtime 
system detects automatically the data segment of the program and applies page 
migration algorithms by scanning the data segment periodically. The algorithms 
can exploit feedback from compiler analysis for data distribution, however they 
are primarily designed to operate in a fully automated process. The program- 
ming overhead for using the algorithms amounts to recompiling and linking the 
program with the runtime system. 

The design of our runtime data distribution method is consistent with the cur- 
rent design principles of OpenMP, which dictate that the implementation must 
hide the details of the parallel architecture and the underlying hardware/software 
interface from the programmer. We consider our work as part of a broader effort 
towards building parallel programming models that meet the requirements of 
code and performance portability simultaneously. The purpose is to render the 
average programmer able to rapidly develop efficient portable parallel code and 
minimize the effort / speedup ratio, using a combination of OpenMP and runtime 
techniques for performance tuning. 



1.1 Motivation 



We have successfully applied our runtime data distribution method in paral- 
lel programs where manual data distribution would otherwise be necessary to 
reduce the number of remote memory accesses piSIhll O] . So far, we have eval- 
uated the performance of our runtime system (named UPMlib after user-level 
page migration) using the OpenMP implementations of the NAS benchmarks |S| . 
Using these codes as a starting point provided us with a handful of advantages. 
The NAS benchmarks are highly parallel codes with coarse-grain parallelism 
and high sustained efficiency on most scalable parallel architectures; they are 
already well-tuned by their providers, the OpenMP implementations in particu- 
lar are tuned specifically for the 0rigin2000, encompassing sophisticated cache- 
conscious programming and NUMA-aware data placement; moreover, the best 
manual data distribution algorithms, as well as the best automatic data place- 
ment algorithms for the NAS benchmarks were a-priori known to us, therefore 
we had a well-defined basis for comparisons. 

In this work, we report our experience from using our runtime data distribu- 
tion method with floating-point benchmarks from the SPEC CPU2000 and the 
SPEChpc benchmark suites Compared to the NAS benchmarks, the SPEC 
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Table 1. Coverage of parallel code in the SPEC benchmarks with which we experi- 
mented. 



Benchmark 


Coverage 


Maximum speedup (Amdahl’s law) 


swim 


77% 


4.34 


mgrid 


80% 


5.00 


applu 


51% 


2.04 


equake 


93% 


14.28 


climate 


80% 


5.00 



codes present us with a different picture. The native SPEC CPU2000 bench- 
marks are not parallelized. A very short parallelization effort of about a couple 
of hours per benchmark indicated that the codes expose a degree of parallelism 
which is well below that of the NAS benchmarks. The coverage of parallel code 
(ratio of the execution time of parallel code over the execution time of the entire 
program on a single processor) in the benchmarks we parallelized averages 76% 
(see Table Q . Amdahl’s law suggests that the theoretical maximum speedups 
that the benchmarks can attain is very limited (6.13 on average). Practically, 
the parallelized loops are too fine-grain to provide sizable speedups. The actual 
maximum speedup of the hand-parallelized codes on a 64-processor 0rigin2000 
ranges between 1.3 and 10. Of more interest to our work, is the fact that we are 
not aware of what is the best data placement algorithm for these benchmarks on 
a NUMA system, in order for us to have a point of reference for the performance 
of UPMlib. 

The objective of this study is to investigate the extent up to which memory 
access localization can improve the scalability of codes with the characteristics 
of the SPEC benchmarks, on medium-scale NUMA systems. An important prop- 
erty common to both the SPEC and the NAS benchmarks, is that the codes are 
iterative, i.e. they repeat the same piece of computation for a number of itera- 
tions that correspond to ticks of a virtual timer. This is the exact class of codes 
for which our runtime system is more effective in localizing memory accesses P| . 
Given that our runtime system is already tested against manual data distribu- 
tion using iterative codes with known best data placement algorithms 0, a study 
with the SPEC benchmarks can provide us with an indication of the importance 
of memory accesses locality, without the burden of investigating, implementing 
and testing explicit data distribution schemes for the benchmarks. 

Our findings are summarized as follows: Proper relocation of pages for local- 
izing memory accesses has a significant impact on the scalability of the SPEC 
benchmarks. Linking the codes with UPMlib yields a speedup of up to 45% 
over the minimum execution time of the unmodified OpenMP code and up to 
50% over the execution time with the maximum number of processors on the 
system with which we experimented (a 64-processor SGI 0rigin2000). It can be 
argued that the same or even higher speedup could be obtained with manual 
data distribution or careful restructuring of the code to localize the memory 
access pattern of each processor. This argument is counterweighted by the fact 
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that these transformations require substantially more programming effort and 
non-portable extensions to OpenMP. 

A second outcome of our experiments is that a good fraction of the speedup 
obtained from our runtime page migration algorithms can be obtained with a 
simpler automatic page placement algorithm, which invalidates the pages that 
store the data accessed within parallel loops and maps them locally to each 
processor on a first-touch basit0, the first time a parallel loop is executed. This 
mechanism resembles the generic first-touch page placement algorithm [0|, but 
instead of being used from the beginning of execution and for the whole address 
space of the program, it is used right before the first iteration of the outer time- 
stepping loop that encapsulates the parallel computation and solely for regions of 
the address space which are likely to incur frequent remote memory accesses. The 
intuition behind the algorithm, is to place pages strictly according to the memory 
access pattern of the parallel code. In this way, automatic page placement is not 
biased by the effects of initialization, which may include sequential access of 
critical arrays (forcing the placement of relevant pages on a single node), or a 
parallel initialization phase the memory access pattern of which does not match 
the memory access pattern of the actual parallel computation. 

The rest of this paper is organized as follows. Section El provides some back- 
ground on the basic concepts of our runtime data distribution method. Sectional 
outlines our methodology. Section 0] presents the results from our experiments 
and Section 13 concludes the paper. 

2 Background 



UPMlib uses dynamic page migration as a tool for implicit data distribution. 
The runtime system infers the memory access pattern of a program by taking 
snapshots of page reference counters 0 The key idea of the memory management 
algorithms of UPMlib is to retrieve snapshots of reference traces that reflect 
accurately the memory access pattern of the whole program, or specific parts of 
the program for which data distribution is required to localize memory accesses. 
Using these snapshots, the runtime system distributes data transparently to 
the programmer, using a cost/benefit criterion based on the frequency and the 
latency of remote memory accesses to each page. 

In iterative parallel codes (which constitute the majority of parallel codes in 
use today), implementing global data distribution with UPMlib is as simple as 
retrieving a snapshot of the memory access trace at the end of the first iteration 
of the parallel computation and applying the page migration criterion on this 
snapshot. It has been shown that this simple runtime technique performs at least 

^ The term Erst-toucb refers to a page placement algorithm in which the processor 
that touches a page first maps the page to a local memory module. 

^ Currently, the system is implemented on the 0rigiu2000, which collects per-node 
reference information for each page in hardware counters. The counters are copied 
back to memory by the operating system upon overflow. 
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Table 2. Parallelized loops in the SPEC CPU2000 benchmarks (subroutines enclosing 
the loops and loop labels given in parentheses). 



1 71. swim 


INITAL (50, 60, 70,75,86) 
CALC1(100, 110,115) 
CALC2(200, 210,215) 
CALC3Z(400) 
CALC3(300,320,325) 


1 72. mgrid 


PSINV(600) 

RESID(600) 

RPJ3(100) 

INTERP (400,800) 

ZERO3(100) 

ZRAN3(400) 


1 73. applu 


jacld(outer k-loop) 
jacu(outer k-loop) 
rhs(outer k- and j-loops) 


183. equake 


smvp(i-Ioop) 

all i-loops inside main time-stepping loop 



as well and usually better than manual data distribution in coarse-grain, em- 
barrassingly parallel codes like the NAS benchmarks mu. while it can be easily 
extended to implement data redistribution within iterations. Under certain cir- 
cumstances, UPMlib outperforms data distribution, because the runtime library 
captures more accurate page reference information. 

The main disadvantage of UPMlib is that it is prone to the overhead of 
the page migration algorithms, which require expensive operations such as data 
copying, TLB coherence maintenance and several system calls for accessing hard- 
ware counters. This disadvantage shows up in fine-grain codes, more specifically, 
when the execution time per iteration is in the order of a few tens of milliseconds 
or less. Note that this problem occurs in data distribution tools as well, it is not 
an inherent disadvantage of UPMlib. A second disadvantage in that the page 
migration algorithms of UPMlib require some form of repeatability in the mem- 
ory access pattern. The algorithms can not handle adaptive programs, that is, 
codes that perform an unpredictable amount of computation in each time step. 
Both issues, i.e. the runtime overhead of UPMlib and adaptive programs are a 
subject of ongoing work. We attempt to address the former by parallelizing the 
page migration algorithms, via inlining the algorithms in the OpenMP threads. 
For the latter, we investigate statistical approaches for data distribution. 

3 Methodology 

We manually parallelized six benchmarks from the SPEC CPU2000 floating 
point suite, namely swim, mgrid, equake, applu, apsi and lucas. The results for 
apsi, lucas and applu are qualitatively similar. More specifically, the speedup of 
the benchmarks flattens beyond 2 processors, due to the very limited coverage 
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of parallel code (around 50%). We omit the results from the executions of these 
benchmarks, since they do not appear to be of interest to our study. 

We parallelized the most time-consuming loops in each benchmark. We did 
not attempt to exploit coarse-grain task-level parallelism, although some bench- 
marks might benefit from it P|. Most of the loops that we parallelized are iden- 
tified as parallel by the SGI MIPSpro compiler, however in some cases, most 
notably in applu and equake, we had to manually apply privatization, rewrit- 
ing of loop indices and induction variable elimination. The parallelized loops for 
each benchmark are listed in Table El Although the effort spent in parallelizing 
the benchmarks was not major, we can state with some confidence that extract- 
ing more parallelism out of these benchmarks would probably require highly 
sophisticated interprocedural analysis and other parallelization techniques not 
found in most commercial compilers. It might also require an extended execution 
model that exploits multiple levels of task and data parallelism Q. Even if these 
techniques are applied, there is no clear evidence of their effectiveness. We have 
also experimented with the SPEChpc mesoscale climate modeling code. Parts of 
this code are already parallelized with OpenMP directives. We ran this code as 
distributed in the SPEChpc96 suite. 

We executed three versions of the benchmarks. The first version is the un- 
modified parallelized OpenMP code. This code is executed using STATIC loop 
scheduling for all parallel loops and the first-touch page placement algorithm 
of IRIX. The IRIX kernel-level page migration engine was disabled during the 
experiments. IRIX includes a competitive dynamic page migration engine, which 
is activated by setting the JDSMJTIGRATION environment variable. We have 
run numerous tests with the IRIX page migration engine and found no case were 
the engine could provide meaningful improvements. All benchmarks except swim 
were slowed down by the IRIX page migration engine (equake slowed down by 
as much as 17%). Swim’s improvement was less than 2% on 32 processors. Note 
that all benchmarks except swim have no parallel initialization phase. Parallel 
initialization is helpful in certain cases, because if an automatic data placement 
algorithm is applied during initialization, data may be distributed quite effec- 
tively before the beginning of the main parallel computation. 

The second version is the parallelized OpenMP code linked with UPMlib. 
UPMlib applies a competitive page migration criterion to pages accessed during 
the execution of parallel code at the end of the first and, if needed, subsequent 
iterations of the outer time-step loop m- During our experiments with the 
SPEC benchmarks, all migrated pages were identified after the execution of the 
first iteration and the runtime system’s page migration engine was deactivated 
thereafter. The overhead of executing the page migration algorithm was therefore 
minimal. Note that although UPMlib minimizes its runtime overhead, the cost 
of page migration is still significant enough to account for, as shown in Section® 

The third version of the benchmarks is produced from the OpenMP code 
with the following modification. The pages accessed during the execution of par- 
allel code are invalidated with the mprotect() system call, right before the first 
iteration of the outer time-stepping loop. We install a handler for the SIGSEGV 
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signal, which records the address of the faulting page and maps the page to a 
local frame (i.e. residing on the same node with the processor that incurs the 
fault) using mmap(). This modification implements a restricted form of first- 
touch page placement. In particular, pages are mapped locally to the processor 
that touches them first during the first executed instance of a parallel loop. This 
ensures that pages are placed on a first-touch basis, according to the page ref- 
erence pattern of the parallelized part of the code, rather than the reference 
pattern of the program as a whole. Basically, the mechanism discards any inop- 
portune placement of pages that might be performed from the operating system 
during the initialization phase. We applied one optimization in the algorithm, 
for situations where the memory access patterns of different parallel loops do not 
match. In these cases, the algorithm applies first-touch page placement during 
the most time consuming loops with the same access pattern. We applied the 
algorithms in all benchmarks except swim. Swim already includes a parallelized 
initialization phase, which performs implicitly the placement of pages that our 
modification would otherwise do during the first iteration of the time-stepping 
loop. 

Our hardware platform is a 64-processor (32-node) SGI 0rigin2000, with 
MIPS RlOk processors running at 250 MHz. Each processor has 32 Kbytes of 
split LI cache and 4 Mbytes of unified L2 cache. The system has 12 Gbytes of 
DRAM memory. The experiments were conducted on an idle system. The re- 
ported execution times are medians of three runs. We used the train problem 
sizes. Although this was mandated by system administration restrictions in the 
time available for running experiments on an idle system, the selected problem 
sizes are coarse enough to indicate parallel performance trend^. The maximum 
number of processors used was 62, because we detected contention between the 
IRIX kernel and the application threads when the benchmarks used all 64 pro- 
cessors. The codes were compiled the with -02 optimization level. 



4 Results 

Figure n illustrates the execution times of four benchmarks versus the number 
of processors. Note that the y-axes are drawn logarithmic and the minimum 
and maximum values of the axes are tuned according to each benchmark’s par- 
allel execution time, for the sake of readability. The labels correspond to the 
unmodified OpenMP code (OpenMP), the OpenMP code linked with UPMlib 
(OpenMP+upmlib) and the OpenMP code modified to use our optimized first- 
touch page placement algorithm (OpenMP +opt Jt) . 

We observe a clear trend in the results. UPMlib reduces the minimum execu- 
tion time of the benchmarks, regardless of the number of processors on which this 
time is obtained, by up to 45% (see Table E|) . On the maximum number of pro- 

® The granularity of the train problem size is roughly equivalent to that of the Class 
A problem size of the NAS benchmarks. The execution time per iteration is in the 
order of several hundreds of milliseconds. 
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processors processors 

equake climate 



Fig. 1. Execution times. 



cessors the improvement is up to 50%. On average, the margin of improvement 
is similar to that observed for the NAS benchmarks [Jj. 

Figures El and El illustrate how UPMlib localizes memory accesses for higher 
performance. The charts show histograms of memory accesses from the execu- 
tions of the benchmarks on 32 processors, divided into local (gray part) and 
remote (black part) memory accesses. The applied page migration algorithms 
have two important effects. First, they convert a significant fraction of remote 
memory accesses (in all cases more than 50%) into local memory accesses. On 
the 0rigin2000, this translates into a net saving of at least 100 ns. and up to 800 
ns. per memory access 0. The reductions in execution time are roughly propor- 
tional to the amount of remote memory accesses converted into local ones by 
UPMlib. The benchmarks with more remote memory accesses benefit more from 
dynamic page migration. Mgrid and climate, which average 1.5 and 6 million re- 
mote memory accesses per node, enjoy speedups of 45% and 26% respectively. 
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Table 3. Reduction of execution time (in percent) with UPMlib and our optimized 
first-touch algorithm. 





UPMlib 


Optimized first touch 




min. exec, time 


exec, time on 62 proc. 


min. exec, time 


exec, time on 62 proc. 


swim 


-7.4 


-8.7 


-2.5 


-3.7 


mgrid 


-44.8 


-50.3 


-44.5 


-18.4 


equake 


-12.7 


-25.7 


-41.2 


-38.4 


climate 


-26.4 


-25.9 


-8.7 


-8.0 


avg. 


-20.3 


-27.7 


-24.2 


-17.1 


stdev 


16.7 


32.4 


29.7 


20.5 



The second and apparently equally important effect of our runtime data 
distribution method is the alleviation of contention. The memory access traces 
reveal that all benchmarks have a highly unbalanced pattern of remote memory 
accesses. This means that a few nodes are accessed remotely significantly more 
frequently compared to the other nodes. The nodes that concentrate frequent 
remote memory accesses are likely to suffer from contention at their memory 
modules and network links. Contention is an important, yet underestimated ef- 
fect on the performance of NUMA systems. On the 0rigin2000, the contention 
factor may account for as much as an additional 50 ns. per contended node 
per remote memory access 0. UPMlib reduces contention by reducing and dis- 
tributing evenly the remote memory accesses. The right charts in Figures |2| and 
0 show that UPMlib achieves an almost perfectly balanced distribution of re- 
mote memory access in mgrid and equake. The remote memory accesses in swim 
are lightly unbalanced, however the low number of remote memory accesses per 
processor makes the contention effect almost imperceptible. 

Intuitively, we expected somewhat more improvements, because unlike the 
OpenMP implementations of the NAS benchmarks , the SPEC codes are not 
tuned for efficient execution on a NUMA system. Practically, this effect does 
not show up in the experiments because the scalability of the SPEC benchmarks 
is primarily limited by the limited coverage of parallel code. A closer look at 
the results reveals that the benchmarks that have the higher speedup are also 
the ones that benefit more from the use of our page migration engine. Based on 
this observation, we speculate that extracting more parallelism out of the bench- 
marks is likely to make the impact of memory access localization more profound 
and the use of our runtime data distribution method vital. On the other hand, 
the magnitude of difference between the OpenMP and the OpenMP+upmlib 
versions indicates that the hardware of the 0rigin2000 is quite effective in re- 
ducing the impact of remote memory accesses, by guaranteeing a relatively low 
remote-to-local memory access latency ratio. We expect this trend to prevail 
in next-generation NUMA systems, which include full-fledged hardware mecha- 
nisms for reducing the impact of remote memory accesses, such as remote access 
caches and COMA protocols. 
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nodes 



nodes 



mgrid, OpenMP 



mgrid, OpenMP+upmlib 



Fig. 2. Memory access histograms of swim and mgrid during their execution on 32 
processors (16 nodes) of the 0rigin2000. 



Interestingly, our optimized first-touch algorithm outperforms the native 
IRIX page placement algorithm. In two benchmarks, swim and climate, the 
improvements are considerably lower compared to those yielded by UPMlib (see 
Table El). Figure El shows that in climate, the optimized first-touch algorithm 
has more remote memory accesses. The pattern of remote memory accesses is 
also highly unbalanced, which is a strong indicator of contention. The results 
show that first-touch is not the best choice of an automatic page placement 
algorithm for climate. We attempted to fix this problem using a round-robin 
page placement algorithm instead of first-touch without success. Round-robin 
performs slightly better than first-touch because it distributes better the remote 
accesses. However, the actual number of remote memory accesses is increased 
with round-robin. The memory access pattern of climate is both unbalanced 



A Study of Implicit Data Distribution Methods for OpenMP 



125 




nodes nodes 



equake, OpenMP equake, OpenMP+upmlib 




nodes nodes 

climate, OpenMP climate, OpenMP+upmlib 

Fig. 3. Memory access histograms of equake and climate during their execution on 32 
processors (16 nodes) of the 0rigin2000. 



and irregular, therefore dynamic page migration is the best option for optimiz- 
ing memory access locality. 

Both the page migration algorithm and the optimized first-touch algorithm 
perform approximately the same in mgrid. In equake, the optimized first-touch 
algorithm outperforms the IRIX page placement algorithm by a significantly 
wider margin compared to our dynamic page migration engine. This result is 
somewhat surprising. Figure 0 shows that the optimized first-touch algorithm 
incurs less remote memory accesses than the page migration algorithm. We were 
not able to find a convincing explanation for this effect, other than that UPMlib 
relocates pages that concentrate frequent remote memory accesses later than 
the optimized first-touch algorithm. The reason which is more likely to explain 
the performance of the page migration engine in equake is the overhead of page 
migrations. 
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climate, OpenMP+upmlib 



climate, OpenMP+opt_ft 



Fig. 4. Comparison of the memory access traces of the page migration engine and the 
optimized first-touch algorithm in climate and equake. 



Figure El shows the relative overhead of the page migration algorithm during 
the executions of the benchmarks on 62 processors. UPMlib uses a thread that 
executes the page migration algorithms in parallel with the program. Although 
the overhead of page migration is overlapped, there is still an interference be- 
tween the UPMlib thread and one or more OpenMP threads. We conservatively 
estimated this interference by measuring the CPU time spent by the UPMlib 
thread and assuming that 50% of this CPU time is spared from a thread of 
the program. The relative overhead of page migration in equake exceeds 20% 
of the execution time of the benchmark. As a comparison, the relative overhead 
of the optimized first-touch algorithm in equake is 4% of the total execution 
time (shown in the right chart of Figure El)- The overhead of page migration is 
noticeable in swim and climate as well. 
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Fig. 5. Overhead of page migration (left) and the optimized first-touch algorithm 
(right), normalized to the execution time of the benchmarks. 



The performance of our optimized first-touch algorithm brings life to an old 
idea suggesting that automatic page placement algorithms for NUMA systems 
can be more effective, when coupled with hints from the runtime system about 
the points of execution at which the algorithms should be activated (referred to 
as phase changes in the related literature [B|). The actual contribution to our 
OpenMP memory management framework is a low-cost, transparent mechanism 
for localizing memory accesses, which might prove of practical use in cases where 
dynamic page migration is vulnerable, most notably in fine-grain parallel codes 
0 . 

5 Conclusions 

This paper analyzed the performance of runtime memory management algo- 
rithms that localize the memory accesses of OpenMP programs on a NUMA 
system, using programs from the SPEC benchmark suites. The SPEC bench- 
marks are challenging for automatic memory management algorithms, because 
they are not embarrassingly parallel, neither are they tuned for efficient exe- 
cution on NUMA systems. Linking the codes with our runtime system yielded 
a solid performance improvement of 20-25% on average. Similar or somewhat 
lower improvements were obtained from a simpler algorithm that places pages 
on a first-touch basis during the first invocation of each parallel loop. This mech- 
anism may be valuable when the overhead of page migration is non-negligible. 
Overall, the results are consistent with a recently established trend that favors 
the use of intelligent runtime methods to scale OpenMP on clustered NUMA 
architectures, without modifying the appealing OpenMP API 0. We plan to 
take several steps in this direction, hoping to secure performance portability 
with unmodified implementations of the OpenMP standard. 



128 D.S. Nikolopoulos and E. Ayguade 



Acknowledgments. Jesus Labarta, Theodore Papatheodorou and Constan- 
tine Polychronopoulos have contributed valuable insight in earlier stages of this 
research. This work was supported by NSF Grant No. EIA-9975019 and the 
Spanish Ministry of Education Grant No. TIC98-511. The experiments were 
conducted with resources provided by the European Center for Parallelism of 
Barcelona (CEPBA). 

References 

1. E. Ayguade, X. Martorell, J. Labarta, M. Gonzalez, and N. Navarro. Exploiting 
Multiple Levels of Parallelism in OpenMP: A Case Study. In Proc. of the 1999 
International Conferenee on Parallel Processing (ICPP’99), pages 172-180, Aizu, 
Japan, August 1999. 

2. S. Benkner and T. Brandes. Exploiting Data Locality on Scalable Shared Memory 
Machines with Data Parallel Programs. In Proc. of the 6th International EuroPar 
Conference (EuroPar’2000), pages 647-657, Munich, Germany, August 2000. 

3. J. Bircsak, P. Graig, R. Crowell, Z. Cvetanovic, J. Harris, C. Nelson, and C. Offner. 
Extending OpenMP for NUMA Machines. In Proc. of the lEEE/ACM Supercom- 
puting’2000: High Performance Networking and Computing Conference (SC’2000), 
Dallas, Texas, November 2000. 

4. D. Lenoski C. Hristea and J. Keen. Measuring Memory Hierarchy Performance 
on Cache-Coherent Multiprocessors Using Microbenchmarks. In Proc. of the 
ACM/IEEE Supercomputing ’97: High Performance Networking and Computing 
Conference (SC’97), San Jose, California, November 1997. 

5. H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of the NAS Parallel 
Benchmarks and its Performance. Technical Report NAS-99-011, NASA Ames 
Research Center, October 1999. 

6. M. Marchetti, L. Kontothanassis, R. Bianchini, and M. Scott. Using Simple Page 
Placement Schemes to Reduce the Cost of Cache Fills in Coherent Shared-Memory 
Systems. In Proc. of the 9th IEEE International Parallel Processing Symposium 
(IPPS’95), pages 380-385, Santa Barbara, California, April 1995. 

7. D. Nikolopoulos, E. Ayguade, J. Labarta, T. Papatheodorou, and C. Poly- 
chronopoulos. The Trade-Off between Implicit and Explicit Data Distribution in 
Shared- Memory Programming Paradigms. In Proc. of the 15th ACM International 
Conference on Supercomputing, Sorrento, Italy, June 2001. 

8. D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and 
E. Ayguade. A Transparent Runtime Data Distribution Engine for OpenMP. 
Scientific Programming, May 2001. 

9. D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. 
Ayguade. Is Data Distribution Necessary in OpenMP ? In Proc. of the lEEE/ACM 
Supercomputing’2000: High Performance Networking and Computing Conference 
(SC’2000), Dallas, Texas, November 2000. 

10. D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. 
Ayguade. UPMlib: A Runtime System for Tuning the Memory Performance of 
OpenMP Programs on Scalable Shared-Memory Multiprocessors. In Proc. of the 
5th ACM Workshop on Languages, Compilers and Runtime Systems for Scalable 
Computers (LCR’2000) , LNCS Vol. 1915, pages 85-99, Rochester, New York, May 
2000 . 




A Study of Implicit Data Distribution Methods for OpeuMP 



129 



11. V. Schuster and D. Miles. Distributed OpenMP, Extensions to OpenMP for SMP 
Clusters. In Proc. of the Workshop on OpenMP Applications and Tools (WOM- 
PAT’2000), San Diego, California, July 2000. 

12. Standard Performance Evaluation Corporation (SPEC). SPEC CPU2000 and 
SPEC hpc96 documentation, http://www.spec.org, accessed 2001. 




OmniRPC: A Grid RPC Facility for Cluster and 
Global Computing in OpenMP 
(Extended Abstract) 



Mitsuhisa Sato^, Motonari Hirano^, Yoshio Tanaka^, and Satoshi Sekiguchi^ 



^ Real World Computing Partnership, Tsukuba, Japan 
^ Software Research Associates, Inc 
^ Electrotechnical Laboratory 



Abstract. Omni remote procedure call facility, OmniRPC, is a thread- 
safe grid RPC facility for cluster and global computing environments. 
The remote libraries are implemented as executable programs in each 
remote computer, and OmniRPC automatically allocates remote library 
calls dynamically on appropriate remote computers to facilitate location 
transparency. We propose to use OpenMP as an easy-to-use and simple 
programming environment for the multi-threaded client of OmniRPC. 
We use the POSIX thread implementation of the Omni OpenMP com- 
piler which allows multi-threaded execution of OpenMP programs by 
POSIX threads even in a single processor. Multiple outstanding requests 
of OmniRPC calls in OpenMP work-sharing construct are dispatched to 
different remote computers to exploit network-wide parallelism. 



1 Introduction 

In this paper, we propose a parallel programming model for cluster and global 
computing using OpenMP and a thread-safe remote procedure call facility, Om- 
niRPC. 

In recent years, two important computing platforms, a cluster of worksta- 
tion/PC and a computational grid, have been gathering many interests in high 
performance network computing. Recent progress in microprocessors and inter- 
connection networks motivates high performance computing using clusters out of 
commodity hardware. Advances in wide-area networking technology and infras- 
tructure make it possible to construct large scale high-performance distributed 
computing environments, or computational grids that provide dependable, con- 
sistent and pervasive access to enormous computational resources. 

Omni remote procedure call facility, OmniRPC, is a thread-safe implemen- 
tation of Ninf RPC which is a grid RPC facility in a wide-area network. The 
remote libraries for OmniRPC are implemented as executable programs, and 
are registered in each remote computer. The OmniRPC programming interface 
is designed to be easy-to-use and familiar-looking for programmers of existing 
languages such as FORTRAN, C and C-I--I-, and is tailored for scientific com- 
putation. The user can call the remote libraries without any knowledge of the 
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Fig. 1. OpeuMP multi-threaded client and OmniRPCs 



network programming, and easily convert his existing applications that already 
use popular numerical libraries such as LAPACK. A client can execute the time- 
consuming part of his program in multiple and heterogeneous remote computers, 
such as clusters and supercomputers, without any requirement for special hard- 
ware or operating systems. OmniRPC provides uniform access to a variety of 
remote computing resources. 

At the beginning of execution, the initialization of OmniRPC collects the in- 
formation about remote libraries registered in available remote computers. Om- 
niRPC automatically allocates remote library calls dynamically on appropriate 
remote computers to facilitate location transparency. In order to support paral- 
lel programming, the multi-threaded client can issue multiple requests by Om- 
niRPC simultaneously. Each outstanding request is dispatched to a different re- 
mote computer to exploit network-wide parallelism. Although the POSIX thread 
library can be used for programming multi-threaded clients, multi-threaded pro- 
gramming using thread library directly makes the client program complicated. 

While the OpenMP Application Programming Interface (API) is proposed 
for parallel programming on shared-memory multiprocessors, OpenMP provides 
a multi-threaded programming model without the complexity of multi-threaded 
programming. We have developed Omni OpenMP compiler ^], which is a free 
and open-source, portable implementation of OpenMP. In a shared memory 
multiprocessor, threads in OpenMP are eventually bound to physical processors 
for efficient parallel execution. The POSIX thread implementation of the Omni 
OpenMP compiler allows multi-threaded execution of OpenMP programs by 
POSIX threads even in a single processor. 
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OpenMP provides an easy-to-use and simple programming environment for 
the multi-threaded client of OmniRPC. Figure Q illustrates the OpenMP multi- 
threaded client and OmniRPCs. A typical application in cluster and grid envi- 
ronments is parametric execution which executes the same code with different 
input parameters. For this type of applications, we can use OpenMP parallel loop 
directives to execute OmniRPC calls in parallel for different remote computers. 

For a computational grid environment, OmniRPC uses Globus toolkit^ as 
a grid software infrastructure. Although a Globus implementation of MPICH, 
MPICH-G, can be used for parallel programming in Globus, message passing 
programming requires programmers to explicitly code the communication and 
makes writing parallel programs cumbersome. While our proposed model is lim- 
ited to a master-slave model, it provides very simple parallel programming en- 
vironment for a computational grid. 

The parallel programming model with the OpenMP client of OmniRPC can 
be applied to other RPC facilities such as CORBA if these APIs are thread-safe. 
NetSolvePI is a similar RPC facility to our OmniRPC and Ninf. It also provides a 
programming interface similar to ours and automatic load balancing mechanism 
by a agent. To our best knowledge, no experience of parallel programming with 
OpenMP is reported. 

2 OmniRPC: A Thread-Safe Remote Procedure Call 
Facility 

A client and the remote computational nodes which execute remote procedures 
may be connected via a local area network or over a wide-area network. A client 
and nodes may be heterogeneous: data in communication is translated into the 
common network data format. 

The remote libraries are implemented as executable programs which contain 
network stub routine as its main routine, and registered in the registry file in 
each remote nodes. We call such executable programs Ninf executables (pro- 
grams). These stubs are generated from the interface descriptions by the Ninf 
IDL compiler. 

In a client node, a user prepares his own machine file which contains the host 
names of available computation nodes. The OmniRPC initialization function, 
DmniRPC_init, reads registry files in the remote nodes to make the database 
which associates the entry names of remote functions with Ninf executables. 

OmniRPC inherits its API and basic architecture from Ninf. 
DmniRPC_Call 0 is the sole client interface to call the remote library. In 
order to illustrate the programming interface with an example, let us consider 
a simple matrix multiply routine in C programs with the following interface: 

double A[N] [N] ,B[N] [N] ,C[N] [N] ; /* declaration */ 

dmmul(A,B,C,N) ; /* calls matrix multiply, C = A * B */ 

When the dmmul routine is available in a remote node, the client program can 
call the remote library using QmniRPC_Call, in the following manner: 
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OmniRPC_Call("dmmul" , A,B,C,N) ; /* call remote library */ 

Here, dmmul is the entry name of library registered as a Ninf executable on a 
remote node, and A,B,C,N are the same arguments. As we see here, the client 
user only needs to specify the name of the function as if he were making a local 
function call; DmniRPC_Call () automatically determines the function arity and 
the type of each argument, appropriately marshals the arguments, makes the 
remote call to the remote node, obtains the results, places the results in the 
appropriate argument, and returns to the client. In this way, the OmniRPC is 
designed to give the users an illusion that arguments are shared between the 
client and the remote nodes. 

To realize such simplicity in the client programming interface, a client re- 
mote function call obtains all the interface information regarding the called li- 
brary function at runtime from the server. The interface information includes 
the number of parameters, these types and sizes and access mode of arguments 
(read/write). Using these informations, the RPC automatically performs ar- 
gument marshaling, and generates the sequence of sending and receiving data 
from/to the nodes. This design is in contrast to traditional RPCs, where stub 
generation is done on the client side at compile time. 

The interface to a remote function is described in Ninf IDL. For example, 
the interface description for the matrix multiply given above is: 

Define dmmul(long mode_in int n, mode_in double A[n][n], 
mode_in double B [n] [n] , mode_out double C [n] [n] ) 

"... description ..." 

Required "libxxx.o" /* specify library including this 

routine. */ 

Calls "C" dmmul(n,A,B,C) ; /* Use C calling convention. */ 

where the access specifiers , mode_in and mode_out, specify whether the argu- 
ment is read or written. To specify the size of each argument, the other in_mode 
arguments can be used to form a size expression. In this example, the value 
of n is referenced to calculate the size of the array arguments A, B, C. Since it 
is designed for numerical applications, the supported data type in Ninf IDL is 
tailored for such a purpose; for example, the data types are limited to scalars 
and their multi-dimensional arrays. The interface description is compiled by the 
Ninf interface generator to generate a stub program for each library function. 
The interface generator also automatically outputs a makefile with which the 
Ninf executables can be created by linking the stub programs and library func- 
tions. 

To invoke a Ninf executable in a remote node, OmniRPC use the remote shell 
command “rsh” in a local area network and GRAM(Globus Resource Allocation 
Manager) API of Globus toolkit in a grid environment. The Ninf executable 
is invoked with the arguments of the client host name and port number for 
waiting the connection. For a grid environment, we designed OmniRPC on top 
of the Globus toolkit. The Globus I/O module is used in a grid environment 
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instead of TCP/IP in local area network. The Globus also provides security and 
authentication by GSI (Globus Security Infrastructure). 

To handle multiple outstanding RPC requests from a multi-threaded client, 
OmniRPC maintains the queue for outstanding remote procedure calls. 
DmniRPC_Call 0 enqueues the request for the remote call, and blocked for wait- 
ing the return from the remote call. The scheduler thread is created to manage 
the queue. For the requested call in the queue, it searches the database of the 
remote function entries to schedule the requests to the remote nodes. When 
the results are sent back, the scheduler thread receives the results and stores it 
into the output argument of the call. Then, it resumes the waiting thread. The 
current implementation uses a simple round-robin scheduling. The machine file 
contains the maximum number of jobs as well as the list of host names. When all 
remote nodes are busy and the number of jobs reaches to the limit, the thread 
executing DmniRPC_Call () is blocked. As soon as the jobs for requested remote 
call is over, the next request is scheduled if any waiting requests exist in the 
queue. 

3 OpenMP Client Using OmniRPC 

Since OmniRPG is thread-safe, multiple remote procedure calls can be out- 
standing simultaneously from multi-threaded programs written in OpenMP. 

A typical application of OmniRPG in OpenMP is to execute the same pro- 
cedure over different input arguments as follows: 

QmniRPC_init 0 ; /* initialize RPC */ 

#pragma omp parallel for 

ford = 0; i < N; i++) 

DmniRPC_Call ( "work" , i > • • • ) ; 

In this loop, the remote function work are executed in parallel with different 
arguments i in the remote nodes. 

The procedure-level task-parallelism is also described as in the following code: 

#pragma omp parallel sections 

{ 

#pragma omp section 
DmniRPC_Call("subA") ; 

#pragma omp section 
DmniRPC_Call("subB") ; 

#pragma omp section 
DmniRPC_Call("subC") ; 

} 

The OpenMP clients of OmniRPG can be executed in a single processor. In 
our Omni OpenMP compiler, we need to execute OpenMP program containing 
OmniRPG calls in the following environment: 
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— Set the environment variable OMP_SCHEDULE to "static,!", meaning cyclic 
scheduling with chunk size 1. In the compiler, the default loop scheduling is 
block-scheduling which may cause load imbalance when the execution time 
of each remote call changes. 

— Set the environment variable OMP_NUM_THREADS to the number greater than 
the total number of jobs in available remote nodes. The large numbers of 
threads are needed to issue the remote procedure call simultaneously for 
many remote nodes. Furthermore, multiple requests to the remote nodes may 
hide the latency of communication and the overhead of executable invocation 
by the local scheduler in remote nodes. 

— Compile with “mutex-lock” configuration. As default, the Omni OpenMP 
compiler uses the spin-lock for fast synchronization in a multiprocessor. It, 
however, sometimes delays the context-switch between threads in a single 
processor. The mutex-lock configuration uses the mutex lock of the POSIX 
thread library for better operating system scheduling. 

In the recent release OpenMP 2.0, there is a new clause, NUM_THREADS to 
parallel region directives. This clause requests that a specific number of threads 
are used in the regions. This also works for nested regions with task-parallelism 
and loop-parallelism as in the following code: 



#pragma omp parallel sections num_threads(3) 

{ 



#pragma section 
#pragma omp parallel for 
ford = 0; i < N; i++) 
#pragma section 
#pragma omp parallel for 
forCj = 0; j < N; j++) 
#pragma section 
worked ; 

} 



num_threads (10) 
QmniRPC_Call("workA" , i) ; 

num_threads (20) 
QmniRPC_Call("workB" , j) ; 



4 Current Status and Future Work 

In this paper, we proposed a parallel programming model using the thread-safe 
OmniRPC in an OpenMP client program for a cluster and global computing. The 
programmer can build a global computing system by using the remote libraries 
as its components, without being aware of complexities and hassles of network 
programming. OpenMP provides an ease-to-use and simple programming envi- 
ronment for a multi-threaded client of OmniRPC. Currently, we have finished 
a preliminary implementation of OmniRPC. We are doing several experiments 
and evaluations on some parametric search applications. 

The current implementation employs a simple round-robin scheduling over 
available remote nodes. In the grid environment, the computation time of RPCs 
may be greatly influenced by many factors including computational ability of 
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the nodes, the distance to the nodes with respect to the bandwidth of commu- 
nication, and the status of the nodes. More sophisticated scheduling using such 
dynamic information reported by the local job scheduler in the remote node will 
be required for efficient remote execution. 

We are also developing a remote executable management tool for OmniRPC, 
which sends the IDL of remote functions to generate stubs in remote nodes 
automatically. It allows the user to install remote libraries without complex and 
time-consuming install procedure of remote executables for many remote nodes. 
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